1. Problem Definition¶
Clearly defining the business problem or question to be solved. This ensures the project's objectives are aligned with organizational goals.
PROJECT 2
Exploratory analysis and predictive modeling of housing prices in Barcelona using KNIME, AutoML and Power BI
Objective¶
Expand the analysis and predictive modeling of housing prices in Barcelona using advanced tools such as KNIME for ETL and analysis, Power BI for interactive visualization, and AutoML tools as a Low-Code or No-Code Machine Learning platform. The goal is to improve the accuracy of the predictive model and provide interactive visualizations that facilitate decision making.
Problem Definition Consolidated Notes¶
- Project for predictive modeling of housing prices in Barcelona
- Project goal is to improve the accuracy of the predictive model and provide interactive visualizations
- Data Science project will be developed following the Data Science Life Cycle (DSLC) framework
2. Data Collection¶
Gathering relevant data from various sources, such as databases, APIs, or external datasets, ensuring it supports the problem statement.
Data Description¶
- price: The price of the real-state.
- rooms: Number of rooms.
- bathroom: Number of bathrooms.
- lift: whether a building has an elevator (also known as a lift in some regions) or not
- terrace: If it has a terrace or not.
- square_meters: Number of square meters.
- real_state: Kind of real-state.
- neighborhood: Neighborhood
- square_meters_price: Price of the square meter
Importing necessary libraries¶
import pandas as pd
import numpy as np
# To help with data visualization
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
%matplotlib inline
sns.set_style('whitegrid') # set style for visualization
# To supress warnings
import warnings # ignore warnings
warnings.filterwarnings('ignore')
from scipy.stats import zscore
#normalizing
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures # to scale the data
# modeling
import statsmodels.api as sm # adding a constant to the independent variables
from sklearn.model_selection import train_test_split # splitting data in train and test sets
from sklearn.preprocessing import PowerTransformer, StandardScaler # for normalization
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
import lightgbm as lgb
#import catboost as catb
# CatBoost is a fast, scalable, high performance gradient boosting on decision trees library.
# Used for ranking, classification, regression and other ML tasks
# COULDN'T BE TESTED ON THIS PROJECT DUE ISSUES ON SETUP
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
#To check multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# for validation
from sklearn.model_selection import cross_val_score, KFold, cross_validate
# Deploy
import joblib
import streamlit as st
import datetime
import os
Loading the Dataset¶
df=pd.read_csv('DATA_Barcelona_Fotocasa_HousingPrices_Augmented.csv')
- Dataset provided by the academy
Web Scraping¶
- The option of improving data using WebScraping was explored
- Attempts at self-learning were reflected in python programs "scraper_fotocasa.ipynb" and "scraper_v2.ipnb"
- Programs aimed to browse the fotocasa website and collect information following the format of the dataset provided by the academy.
- The programs were not completed, they are not functional in any version. Development of the programs was stopped on academic recommendation.
- Web scraping can raise legal and ethical considerations, especially if it involves accessing data without authorization or violating a website's terms of service.
- The academy recommended requesting permission from the web portal before continuing with development.
- The related links were read (https://www.fotocasa.es/es/politica-privacidad/p ; https://www.fotocasa.es/es/aviso-legal/cp ; https://www.fotocasa.es/es/aviso-legal/ln) and no explicit information was found regarding the authorization or prohibition of web scraping activities.
- It is noted that the absence of explicit permission does not imply consent.
- Permission was requested from the portal and a negative response was received:
Hello Carlos,
We are sorry that we cannot help you, since for privacy reasons we do not carry out this type of collaboration.
Regards,
ayuda@fotocasa.zendesk.com
Property types¶
- As the problem aims to predict housing prices in Barcelona, a brief complementary information about property types in Spain is included as reference.
- Studio (Estudio): Typically the smallest type of dwelling, a studio is a single open space that combines the living area, bedroom, and kitchen, with a separate bathroom. These are ideal for individuals or couples seeking a compact living space.
- Attic (Ático): An attic refers to a top-floor apartment, often featuring sloped ceilings and sometimes including a terrace. The size can vary, but attics are generally larger than studios and may offer unique architectural features.
- Apartment (Apartamento): In Spain, the term "apartamento" usually denotes a modest-sized dwelling, typically with one or two bedrooms. These are suitable for small families or individuals desiring separate living and sleeping areas.
- Flat (Piso): The term "piso" is commonly used to describe larger residential units, often with multiple bedrooms and ample living space. Flats are prevalent in urban areas and cater to families or individuals seeking more spacious accommodations.
Data Collection Consolidated Notes¶
- The project will consider the data provided by the academy
- Web scraping involves automatically extracting data from websites, which can be subject to legal restrictions depending on the website's policies and applicable laws.
- As the problem aims to predict housing prices in Barcelona, a brief complementary information about property types in Spain is included as reference.
3. Data Preparation¶
Cleaning, preprocessing, and organizing the data. This includes handling missing values, outliers, data transformations, and feature engineering
Data Overview¶
df.head() # preview a sample first 5 rows
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 750 | 3.0 | 1.0 | True | False | 60.0 | flat | Horta- Guinardo | 12.500000 |
| 1 | 1 | 770 | 2.0 | 1.0 | True | False | 59.0 | flat | Sant Andreu | 13.050847 |
| 2 | 2 | 1300 | 1.0 | 1.0 | True | True | 30.0 | flat | Gràcia | 43.333333 |
| 3 | 3 | 2800 | 1.0 | 1.0 | True | True | 70.0 | flat | Ciutat Vella | 40.000000 |
| 4 | 4 | 720 | 2.0 | 1.0 | True | False | 44.0 | flat | Sant Andreu | 16.363636 |
df.tail() # preview a sample last 5 rows
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 16371 | 16371 | 950 | 1.982 | 0.957 | True | False | 60.701 | flat | Sarria-Sant Gervasi | 13.174 |
| 16372 | 16372 | 825 | 1.086 | 0.961 | True | False | 47.224 | flat | Eixample | 14.893 |
| 16373 | 16373 | 1200 | 4.195 | 1.957 | True | False | 116.100 | flat | Les Corts | 10.746 |
| 16374 | 16374 | 1100 | 2.899 | 2.155 | False | False | 57.805 | flat | Sant Martí | NaN |
| 16375 | 16375 | 850 | 2.127 | 1.024 | True | False | 58.503 | flat | Eixample | 15.390 |
df.sample(20) # preview a sample random n rows
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 5205 | 5205 | 1350 | 4.000 | 2.000 | True | False | 120.000 | flat | Sarria-Sant Gervasi | 11.250000 |
| 4992 | 4992 | 950 | 3.000 | 1.000 | True | False | 76.000 | attic | Horta- Guinardo | 12.500000 |
| 1582 | 1582 | 1650 | 4.000 | 2.000 | True | False | 135.000 | flat | Sarria-Sant Gervasi | 12.222222 |
| 7434 | 7434 | 1350 | 3.000 | 2.000 | True | False | 80.000 | flat | Eixample | 16.875000 |
| 14345 | 14345 | 2250 | 2.708 | 1.813 | True | True | 175.068 | flat | Eixample | 14.308000 |
| 8638 | 8638 | 950 | 2.193 | 0.994 | False | False | 58.671 | flat | Sarria-Sant Gervasi | 16.291000 |
| 2187 | 2187 | 725 | 1.000 | 1.000 | True | False | 70.000 | flat | Ciutat Vella | 10.357143 |
| 7954 | 7954 | 1100 | 1.000 | 1.000 | True | False | 57.000 | flat | Eixample | 19.298246 |
| 7441 | 7441 | 700 | 1.000 | 1.000 | False | False | 33.000 | flat | Ciutat Vella | 21.212121 |
| 4555 | 4555 | 1600 | 2.000 | 1.000 | False | False | 75.000 | flat | Eixample | 21.333333 |
| 1926 | 1926 | 990 | 4.000 | 1.000 | False | False | 75.000 | flat | Eixample | 13.200000 |
| 11446 | 11446 | 800 | 2.820 | 0.906 | False | False | 82.412 | flat | Sarria-Sant Gervasi | 10.680000 |
| 2321 | 2321 | 990 | 2.000 | 2.000 | True | False | 73.000 | flat | Gràcia | 13.561644 |
| 8559 | 8559 | 1350 | 2.080 | NaN | True | False | 63.588 | flat | Ciutat Vella | 20.524000 |
| 2028 | 2028 | 1175 | 4.000 | 2.000 | True | False | 90.000 | flat | Eixample | 13.055556 |
| 5207 | 5207 | 720 | 2.000 | 1.000 | False | False | 53.000 | flat | Ciutat Vella | 13.584906 |
| 2951 | 2951 | 1050 | 2.000 | 2.000 | True | False | 65.000 | flat | Sant Martí | 16.153846 |
| 5535 | 5535 | 1850 | 2.000 | 2.000 | True | False | 144.000 | apartment | Ciutat Vella | 12.847222 |
| 1374 | 1374 | 2800 | 4.000 | 3.000 | True | True | 175.000 | NaN | Sarria-Sant Gervasi | 16.000000 |
| 9272 | 9272 | 1650 | 3.228 | 2.122 | True | False | 88.468 | flat | Sants-Montjuïc | 16.974000 |
- The variable 'Unnamed' represent index and should be deleted from data
- Target variable for modeling is "price"
print("There are", df.shape[0], 'rows and', df.shape[1], "columns.") # number of observations and features
There are 16376 rows and 10 columns.
- There are 16376 rows and 10 columns.
- Project1 data had 8188 rows and 10 columns.
df.dtypes # data types
Unnamed: 0 int64 price int64 rooms float64 bathroom float64 lift bool terrace bool square_meters float64 real_state object neighborhood object square_meters_price float64 dtype: object
- Data types are aligned with information, except variables 'rooms' and 'bathroom' being float and expected integer
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16376 entries, 0 to 16375 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 16376 non-null int64 1 price 16376 non-null int64 2 rooms 15966 non-null float64 3 bathroom 15989 non-null float64 4 lift 16376 non-null bool 5 terrace 16376 non-null bool 6 square_meters 15968 non-null float64 7 real_state 15458 non-null object 8 neighborhood 16376 non-null object 9 square_meters_price 15937 non-null float64 dtypes: bool(2), float64(4), int64(2), object(2) memory usage: 1.0+ MB
- There are missing data (NaN) on multiple variables
df.describe(include="all").T # statistical summary of the data.
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | 16376.0 | NaN | NaN | NaN | 8187.5 | 4727.488339 | 0.0 | 4093.75 | 8187.5 | 12281.25 | 16375.0 |
| price | 16376.0 | NaN | NaN | NaN | 1437.04586 | 1106.831419 | 320.0 | 875.0 | 1100.0 | 1514.0 | 15000.0 |
| rooms | 15966.0 | NaN | NaN | NaN | 2.421662 | 1.13863 | 0.0 | 1.884 | 2.111 | 3.0 | 10.754 |
| bathroom | 15989.0 | NaN | NaN | NaN | 1.504682 | 0.723192 | 0.9 | 1.0 | 1.037 | 2.0 | 8.0 |
| lift | 16376 | 2 | True | 11246 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| terrace | 16376 | 2 | False | 12770 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| square_meters | 15968.0 | NaN | NaN | NaN | 84.368874 | 47.486402 | 10.0 | 56.0855 | 72.748 | 95.0 | 679.0 |
| real_state | 15458 | 4 | flat | 12650 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| neighborhood | 16376 | 10 | Eixample | 4795 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| square_meters_price | 15937.0 | NaN | NaN | NaN | 17.73171 | 9.199731 | 4.549 | 12.777778 | 15.31 | 19.402 | 197.272 |
- Units size goes from 10m2 to 679m2, with a mean of 84.36m2
- Units prices goes from 320EUR to 15000EUR/month, with mean of 1437EUR/month
- price range is assumed referred to monthly rent, so considered as EUR per month
- Units prices by square meter goes from 4.549EUR/m2/month to 197.272EUR/m2/month, with mean of 17.73EUR/m2/month
- There are units listed with cero rooms and 10.754 rooms
- There are units with 0.9 bathroom
# Uniques
df.nunique() # Checking for number of variations in the data
Unnamed: 0 16376 price 889 rooms 1995 bathroom 1015 lift 2 terrace 2 square_meters 7751 real_state 4 neighborhood 10 square_meters_price 9122 dtype: int64
df.columns
Index(['Unnamed: 0', 'price', 'rooms', 'bathroom', 'lift', 'terrace',
'square_meters', 'real_state', 'neighborhood', 'square_meters_price'],
dtype='object')
for i in ['rooms', 'bathroom', 'lift', 'terrace', 'real_state', 'neighborhood']: # Checking uniques
print (i,": ",df[i].unique())
rooms : [3. 2. 1. ... 4.131 4.195 2.899] bathroom : [1. 2. 3. ... 5.898 2.862 2.866] lift : [ True False] terrace : [False True] real_state : ['flat' 'attic' nan 'apartment' 'study'] neighborhood : ['Horta- Guinardo' 'Sant Andreu' 'Gràcia' 'Ciutat Vella' 'Sarria-Sant Gervasi' 'Les Corts' 'Sant Martí' 'Eixample' 'Sants-Montjuïc' 'Nou Barris']
# Uniques
cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
for column in cat_cols:
print(df[column].value_counts())
print("-" * 50)
lift True 11246 False 5130 Name: count, dtype: int64 -------------------------------------------------- terrace False 12770 True 3606 Name: count, dtype: int64 -------------------------------------------------- real_state flat 12650 apartment 1967 attic 633 study 208 Name: count, dtype: int64 -------------------------------------------------- neighborhood Eixample 4795 Sarria-Sant Gervasi 2765 Ciutat Vella 2716 Gràcia 1416 Sant Martí 1257 Sants-Montjuïc 1165 Les Corts 1045 Horta- Guinardo 638 Sant Andreu 368 Nou Barris 211 Name: count, dtype: int64 --------------------------------------------------
- There are four types of real states being the most common "flat"
- Most units do not have terrace
- Most units do have lift
- The neighborhood with largest unit count is "Eixample"
# Display all rows in pandas outputs
pd.set_option("display.max_rows", None) # Set to None to show all rows
# Print the value counts of 'rooms'
print(df['rooms'].value_counts())
print("-" * 50)
print(df['bathroom'].value_counts())
# Optionally reset display settings (if needed later in the script)
pd.reset_option("display.max_rows")
rooms 2.000 2608 3.000 2461 1.000 1600 4.000 1061 0.000 399 5.000 232 6.000 28 0.910 19 0.981 15 1.029 14 0.945 14 0.917 14 1.973 13 1.012 13 0.938 13 1.076 13 1.804 12 1.872 12 0.937 12 0.956 12 3.046 12 1.954 12 1.087 12 2.119 12 2.056 12 1.046 12 2.004 12 1.904 12 1.015 12 2.126 12 0.947 12 1.882 12 1.840 12 2.122 11 1.095 11 1.897 11 3.017 11 3.237 11 1.008 11 1.884 11 2.169 11 2.153 11 2.156 11 2.147 11 0.914 11 1.072 11 0.990 11 1.042 11 2.123 11 1.898 11 2.196 11 1.981 11 0.926 11 1.092 11 1.056 11 1.952 11 2.072 11 0.921 11 2.030 10 1.088 10 1.865 10 2.192 10 1.917 10 0.970 10 2.068 10 1.051 10 0.971 10 0.919 10 2.084 10 1.069 10 1.948 10 1.082 10 2.027 10 1.021 10 2.152 10 3.292 10 2.160 10 2.019 10 1.949 10 0.932 10 2.011 10 1.887 10 1.080 10 1.940 10 0.934 10 2.168 10 2.089 10 1.877 10 2.985 10 0.987 10 2.107 10 1.048 10 1.823 10 1.099 10 0.902 10 0.950 10 1.960 10 0.995 10 2.913 10 0.973 10 1.083 10 2.137 10 1.043 10 1.824 10 1.098 10 1.079 9 2.157 9 1.027 9 0.948 9 2.148 9 2.049 9 2.124 9 2.959 9 0.965 9 2.103 9 1.024 9 2.106 9 1.070 9 0.998 9 1.035 9 1.932 9 0.906 9 2.038 9 1.017 9 0.904 9 0.992 9 0.943 9 1.855 9 2.082 9 2.023 9 1.926 9 1.936 9 2.032 9 1.883 9 1.868 9 1.837 9 1.041 9 0.961 9 1.016 9 1.996 9 1.967 9 1.867 9 3.090 9 2.145 9 2.026 9 2.101 9 1.841 9 1.902 9 0.968 9 1.084 9 0.994 9 1.931 8 1.007 8 1.990 8 1.060 8 2.130 8 3.037 8 1.854 8 2.131 8 1.835 8 2.076 8 1.875 8 0.927 8 0.983 8 0.930 8 2.075 8 2.115 8 2.796 8 1.803 8 3.065 8 1.064 8 1.002 8 1.011 8 1.953 8 0.916 8 2.748 8 1.966 8 1.014 8 1.074 8 2.060 8 3.097 8 1.019 8 3.073 8 1.833 8 3.284 8 2.007 8 2.161 8 2.081 8 2.723 8 0.996 8 2.034 8 0.925 8 2.945 8 3.157 8 0.911 8 2.858 8 0.969 8 2.817 8 1.971 8 1.010 8 1.836 8 2.028 8 1.068 8 1.958 8 1.089 8 2.708 8 2.163 8 3.081 8 0.942 8 1.974 8 0.975 8 1.832 8 2.777 8 2.195 8 1.876 8 2.046 8 0.952 8 2.979 8 0.922 8 0.940 8 3.039 8 2.150 8 1.033 8 2.154 8 2.093 8 1.853 8 2.102 8 1.901 8 1.888 8 2.045 8 1.045 8 1.944 8 2.170 8 0.931 8 0.957 8 1.905 8 0.955 8 2.722 8 1.961 7 2.051 7 1.822 7 2.857 7 2.043 7 0.903 7 1.957 7 0.933 7 2.139 7 2.158 7 2.999 7 2.118 7 2.167 7 2.978 7 0.989 7 1.026 7 3.164 7 0.966 7 3.226 7 1.863 7 2.977 7 3.222 7 1.075 7 0.980 7 0.964 7 2.191 7 1.864 7 1.030 7 2.090 7 2.175 7 0.959 7 2.188 7 1.847 7 3.297 7 0.958 7 2.725 7 3.055 7 2.035 7 0.913 7 1.922 7 2.931 7 2.794 7 2.149 7 1.989 7 1.927 7 2.903 7 2.077 7 1.992 7 1.061 7 1.977 7 2.848 7 3.271 7 2.050 7 2.129 7 0.915 7 3.078 7 2.003 7 2.846 7 1.073 7 1.893 7 2.906 7 2.033 7 2.198 7 2.116 7 1.965 7 2.080 7 1.978 7 1.895 7 0.972 7 1.984 7 0.908 7 1.044 7 1.845 7 2.132 7 2.952 7 3.154 7 0.954 7 1.993 7 2.031 7 1.852 7 2.938 7 3.253 7 0.912 7 2.870 7 1.851 7 1.055 7 1.006 7 2.924 7 0.924 7 1.808 7 2.184 7 1.071 7 2.048 7 2.044 7 1.930 7 1.968 7 0.900 7 2.117 7 2.136 7 2.173 7 1.821 7 1.090 7 1.900 7 2.001 7 3.019 7 2.738 7 2.092 7 3.283 7 2.059 7 3.249 7 2.087 6 2.159 6 1.096 6 1.093 6 2.937 6 2.114 6 2.715 6 2.882 6 0.939 6 0.991 6 2.165 6 2.143 6 1.939 6 2.815 6 1.830 6 3.242 6 1.850 6 2.095 6 2.021 6 2.710 6 1.031 6 2.070 6 2.073 6 2.934 6 3.227 6 1.941 6 2.015 6 2.957 6 3.035 6 2.933 6 3.075 6 3.113 6 1.916 6 3.193 6 2.199 6 3.076 6 2.872 6 1.985 6 2.134 6 3.229 6 2.037 6 2.727 6 2.025 6 3.214 6 1.022 6 2.171 6 1.081 6 2.125 6 1.003 6 2.066 6 3.186 6 0.976 6 2.008 6 3.200 6 1.995 6 1.947 6 3.184 6 2.871 6 1.059 6 2.190 6 2.078 6 2.020 6 2.873 6 1.870 6 3.232 6 2.058 6 2.009 6 3.011 6 3.295 6 2.146 6 2.717 6 3.028 6 2.981 6 2.916 6 1.911 6 3.060 6 1.983 6 1.050 6 1.918 6 2.901 6 2.879 6 3.063 6 3.111 6 0.982 6 2.897 6 3.199 6 1.814 6 2.110 6 0.979 6 2.052 6 2.874 6 1.912 6 1.987 6 1.086 6 1.924 6 1.020 6 3.258 6 3.285 6 1.805 6 0.999 6 2.724 6 1.970 6 3.062 6 3.250 6 1.935 6 2.702 6 3.102 6 1.058 6 0.918 6 2.111 6 1.857 6 2.094 6 3.212 6 3.106 6 0.978 6 2.926 6 0.923 6 1.025 6 2.109 6 0.993 6 2.914 6 2.186 6 2.920 6 1.878 6 0.909 6 1.807 6 2.909 6 2.865 6 2.818 6 2.967 6 0.944 6 3.276 6 1.077 6 1.998 6 2.057 6 3.101 6 1.903 6 0.920 6 1.963 6 2.753 6 0.974 6 3.032 6 2.884 6 2.179 6 2.144 6 1.834 6 3.187 6 1.909 6 1.001 6 2.140 6 2.141 6 2.984 6 2.755 6 0.946 6 3.167 6 1.925 6 1.946 6 1.915 6 2.987 6 2.771 5 3.079 5 3.004 5 2.013 5 2.843 5 3.027 5 2.786 5 1.950 5 3.231 5 1.910 5 2.894 5 3.206 5 2.844 5 2.182 5 2.811 5 3.020 5 3.235 5 1.886 5 3.260 5 3.010 5 3.275 5 1.034 5 1.933 5 1.818 5 3.183 5 2.836 5 1.032 5 2.164 5 1.856 5 2.955 5 3.122 5 3.191 5 2.054 5 3.262 5 3.131 5 3.224 5 3.100 5 1.980 5 3.080 5 3.054 5 1.879 5 1.067 5 1.085 5 3.072 5 3.266 5 2.964 5 2.759 5 1.053 5 2.183 5 4.313 5 3.259 5 1.844 5 2.071 5 2.024 5 1.094 5 2.973 5 3.192 5 2.742 5 2.041 5 2.121 5 2.187 5 1.066 5 1.962 5 2.714 5 3.074 5 3.086 5 3.013 5 3.103 5 1.091 5 1.810 5 3.149 5 1.866 5 3.208 5 1.859 5 2.826 5 1.999 5 1.052 5 0.967 5 3.096 5 1.956 5 1.848 5 2.172 5 2.974 5 3.105 5 2.181 5 1.040 5 1.826 5 1.889 5 3.050 5 2.837 5 3.196 5 2.734 5 4.073 5 2.104 5 1.945 5 2.042 5 2.713 5 1.921 5 2.197 5 1.063 5 2.902 5 3.967 5 1.896 5 1.827 5 2.047 5 4.148 5 2.740 5 1.801 5 0.985 5 2.892 5 2.016 5 3.077 5 2.155 5 1.817 5 1.815 5 3.296 5 2.820 5 3.233 5 2.002 5 2.930 5 3.132 5 2.994 5 1.018 5 4.248 5 0.953 5 0.960 5 2.766 5 3.007 5 2.765 5 2.856 5 3.637 5 0.986 5 4.061 5 4.187 5 1.065 5 2.128 5 3.291 5 3.142 5 2.904 5 2.828 5 0.907 5 3.139 5 3.091 5 3.255 5 2.950 5 2.757 5 2.830 5 1.057 5 2.086 5 4.380 5 1.880 5 4.080 5 4.040 5 2.835 5 2.743 5 3.153 5 1.861 5 3.716 5 3.228 5 2.810 5 3.201 5 2.797 5 3.158 5 2.961 5 2.113 5 2.782 5 1.037 5 2.813 5 2.732 5 2.962 5 3.254 5 2.839 5 1.890 5 2.018 5 1.934 5 2.996 5 3.023 5 2.887 5 1.994 5 2.863 5 2.142 5 2.911 5 2.166 5 2.105 5 0.977 5 1.825 5 2.718 5 3.202 5 1.988 5 0.988 5 3.216 5 3.832 5 0.997 5 1.862 5 3.094 5 1.873 5 1.054 5 2.063 5 3.272 5 0.929 5 2.085 5 3.236 5 2.949 5 3.095 5 7.000 5 2.006 4 3.163 4 3.156 4 1.806 4 3.175 4 2.960 4 3.256 4 1.846 4 3.058 4 2.896 4 1.943 4 2.709 4 2.733 4 2.970 4 2.854 4 1.891 4 3.210 4 2.980 4 3.197 4 2.919 4 4.038 4 3.114 4 2.180 4 2.735 4 2.941 4 2.079 4 2.022 4 2.991 4 1.907 4 2.069 4 1.005 4 3.119 4 3.261 4 2.728 4 2.193 4 2.814 4 1.991 4 1.858 4 1.062 4 2.138 4 3.069 4 3.179 4 3.979 4 3.067 4 3.066 4 3.022 4 2.889 4 2.726 4 2.133 4 3.223 4 1.894 4 3.088 4 3.002 4 3.273 4 2.739 4 2.040 4 3.247 4 2.014 4 3.279 4 3.949 4 3.152 4 1.811 4 2.784 4 2.189 4 3.246 4 1.843 4 3.150 4 2.780 4 2.791 4 2.886 4 3.136 4 4.310 4 2.730 4 2.940 4 2.763 4 1.951 4 3.051 4 1.885 4 2.816 4 3.252 4 2.792 4 2.838 4 0.963 4 4.162 4 3.180 4 1.871 4 3.278 4 1.942 4 3.143 4 3.036 4 2.012 4 3.815 4 1.800 4 2.790 4 4.003 4 1.009 4 3.209 4 2.809 4 3.173 4 2.731 4 1.047 4 2.898 4 1.036 4 1.829 4 2.928 4 3.006 4 1.923 4 3.159 4 3.198 4 3.708 4 2.769 4 3.146 4 3.241 4 0.941 4 3.181 4 2.135 4 2.869 4 2.888 4 2.751 4 4.273 4 3.093 4 3.274 4 3.195 4 2.711 4 3.047 4 2.921 4 1.955 4 3.194 4 3.294 4 2.963 4 2.064 4 3.213 4 1.928 4 3.085 4 1.831 4 2.747 4 2.127 4 2.067 4 2.841 4 3.161 4 1.820 4 2.120 4 3.280 4 2.055 4 2.062 4 2.065 4 2.877 4 3.731 4 2.893 4 4.356 4 1.969 4 3.830 4 2.767 4 0.984 4 2.083 4 3.238 4 2.875 4 3.188 4 3.127 4 3.029 4 2.754 4 1.899 4 3.057 4 2.761 4 2.029 4 3.108 4 2.825 4 2.900 4 3.838 4 4.096 4 1.938 4 1.100 4 1.964 4 1.812 4 3.185 4 3.263 4 3.061 4 3.290 4 2.750 4 1.920 4 3.008 4 3.162 4 3.130 4 1.937 4 2.942 4 0.901 4 2.885 4 3.071 4 1.838 4 4.338 4 2.864 4 2.746 4 1.842 4 2.800 4 3.629 4 2.108 4 2.005 4 3.087 4 3.230 4 3.024 4 1.906 4 3.003 4 3.041 4 2.799 4 0.962 4 3.125 4 2.783 4 3.021 4 2.929 4 1.869 4 1.049 4 2.074 3 2.091 3 3.082 3 3.958 3 0.935 3 2.925 3 2.958 3 1.982 3 1.004 3 2.822 3 3.245 3 2.922 3 4.105 3 3.805 3 3.044 3 3.664 3 4.394 3 2.908 3 2.174 3 2.772 3 4.670 3 3.655 3 3.888 3 2.098 3 1.849 3 4.054 3 1.802 3 3.092 3 2.868 3 3.141 3 3.268 3 3.083 3 3.070 3 3.243 3 3.178 3 3.204 3 0.949 3 4.391 3 4.304 3 3.687 3 2.721 3 3.277 3 2.716 3 1.972 3 3.211 3 2.736 3 3.609 3 3.016 3 3.172 3 3.014 3 2.983 3 2.850 3 3.915 3 3.667 3 4.363 3 3.649 3 3.607 3 2.997 3 3.138 3 2.010 3 2.833 3 3.084 3 2.831 3 1.828 3 3.975 3 2.847 3 2.749 3 2.932 3 4.081 3 4.299 3 1.913 3 2.805 3 3.009 3 2.917 3 1.979 3 3.680 3 2.927 3 3.795 3 3.155 3 2.756 3 2.185 3 4.296 3 3.723 3 4.223 3 2.993 3 3.905 3 2.855 3 4.287 3 2.982 3 2.876 3 4.058 3 2.770 3 5.415 3 4.378 3 2.781 3 2.764 3 4.359 3 2.968 3 2.861 3 3.281 3 1.819 3 3.299 3 1.028 3 5.087 3 3.955 3 4.044 3 1.914 3 3.205 3 4.360 3 4.279 3 4.259 3 4.126 3 3.165 3 2.017 3 2.776 3 3.110 3 3.909 3 1.908 3 0.905 3 2.972 3 4.158 3 2.151 3 3.928 3 2.812 3 2.706 3 3.219 3 2.752 3 3.112 3 3.792 3 1.816 3 2.944 3 2.849 3 0.928 3 3.207 3 3.147 3 4.222 3 2.939 3 3.662 3 2.976 3 3.251 3 3.064 3 2.969 3 2.700 3 2.096 3 2.878 3 2.744 3 2.804 3 3.040 3 3.215 3 3.846 3 1.839 3 3.052 3 4.196 3 2.853 3 2.956 3 2.948 3 2.907 3 4.007 3 3.614 3 3.601 3 3.748 3 3.117 3 3.177 3 1.809 3 3.782 3 3.863 3 4.163 3 2.834 3 3.166 3 3.767 3 4.213 3 2.807 3 4.119 3 3.056 3 4.286 3 2.827 3 3.098 3 3.856 3 3.732 3 3.881 3 2.954 3 4.152 3 3.761 3 3.099 3 3.123 3 3.999 3 3.118 3 2.803 3 3.289 3 3.220 3 2.720 3 2.775 3 1.997 3 2.703 3 2.793 3 2.832 3 3.160 3 2.036 3 2.852 3 3.819 3 3.617 3 4.351 3 2.866 3 2.701 3 3.068 3 3.724 3 2.719 3 10.000 3 4.229 2 3.984 2 3.605 2 3.034 2 3.781 2 4.228 2 0.936 2 4.251 2 2.053 2 4.076 2 2.762 2 3.957 2 2.704 2 4.122 2 5.074 2 4.398 2 4.161 2 3.170 2 3.115 2 3.286 2 3.966 2 3.135 2 3.282 2 4.183 2 3.244 2 3.824 2 3.686 2 3.704 2 4.064 2 3.656 2 3.992 2 2.802 2 4.146 2 2.840 2 3.203 2 3.218 2 4.303 2 3.978 2 3.134 2 4.362 2 3.705 2 2.912 2 3.168 2 2.867 2 4.810 2 2.789 2 4.182 2 3.720 2 4.072 2 4.254 2 3.861 2 3.743 2 5.253 2 4.349 2 2.176 2 4.290 2 3.852 2 3.045 2 3.293 2 2.162 2 2.760 2 4.379 2 3.671 2 2.712 2 4.168 2 4.393 2 5.322 2 4.057 2 4.366 2 3.107 2 1.860 2 3.911 2 2.774 2 3.913 2 4.373 2 3.778 2 2.741 2 3.640 2 3.176 2 1.813 2 2.859 2 4.384 2 3.947 2 4.325 2 4.046 2 2.881 2 3.855 2 4.330 2 3.126 2 4.571 2 4.245 2 4.367 2 3.288 2 5.481 2 4.333 2 3.902 2 4.218 2 4.256 2 4.166 2 3.663 2 4.347 2 3.264 2 2.177 2 3.755 2 4.336 2 5.213 2 4.956 2 3.217 2 1.078 2 2.745 2 3.983 2 4.142 2 4.050 2 4.197 2 3.987 2 4.371 2 3.038 2 3.621 2 4.019 2 4.189 2 3.613 2 1.013 2 4.275 2 2.986 2 3.137 2 3.239 2 3.681 2 3.602 2 3.623 2 4.220 2 4.345 2 4.226 2 1.975 2 1.039 2 3.742 2 4.006 2 3.684 2 3.835 2 1.874 2 3.620 2 3.953 2 3.943 2 4.098 2 4.186 2 3.802 2 4.315 2 4.532 2 3.903 2 3.683 2 4.206 2 4.292 2 3.690 2 4.285 2 4.323 2 3.744 2 2.845 2 4.368 2 3.803 2 3.033 2 4.215 2 4.377 2 3.969 2 3.770 2 3.647 2 2.737 2 3.145 2 3.053 2 3.124 2 3.875 2 1.023 2 4.095 2 3.866 2 2.808 2 3.696 2 3.148 2 3.018 2 2.795 2 2.953 2 4.022 2 3.851 2 4.388 2 3.882 2 3.673 2 3.182 2 3.287 2 4.099 2 4.780 2 3.026 2 3.896 2 3.658 2 3.632 2 4.634 2 3.043 2 1.986 2 4.132 2 4.386 2 3.715 2 1.919 2 3.843 2 3.025 2 3.880 2 2.935 2 3.012 2 2.088 2 1.881 2 1.892 2 4.023 2 3.174 2 4.138 2 2.998 2 2.779 2 4.324 2 4.083 2 3.864 2 3.825 2 3.257 2 3.248 2 3.760 2 4.357 2 4.309 2 4.234 2 3.234 2 4.260 2 3.885 2 3.922 2 3.627 2 3.189 2 4.372 2 4.249 2 3.920 2 4.387 2 3.737 2 2.966 2 3.701 2 4.043 2 4.334 2 3.651 2 3.804 2 3.991 2 3.059 2 4.326 2 3.005 2 3.747 2 3.190 2 3.871 2 4.160 2 1.097 2 3.269 2 2.891 2 5.323 2 5.572 2 4.062 2 3.956 2 2.039 2 2.788 2 3.048 2 2.099 2 3.709 2 3.710 2 3.910 2 2.975 2 4.257 2 3.862 2 2.883 2 4.028 2 4.321 2 4.077 2 4.100 2 2.821 2 4.107 2 4.177 2 2.946 2 4.375 2 3.972 2 2.851 2 4.952 2 2.778 2 3.144 2 4.116 2 3.714 2 4.130 2 4.133 2 2.178 2 4.317 2 3.089 2 3.847 2 3.700 2 4.266 2 3.891 2 4.179 2 4.243 2 2.785 2 3.780 2 4.277 2 3.840 2 4.335 2 2.918 2 1.038 2 3.133 2 3.766 2 3.616 2 4.075 2 2.097 2 3.932 2 3.151 2 3.877 2 3.808 2 4.176 2 9.000 2 3.934 1 5.465 1 4.365 1 3.729 1 3.652 1 5.410 1 5.405 1 3.670 1 3.725 1 3.952 1 4.801 1 4.298 1 4.799 1 4.355 1 4.537 1 3.998 1 5.194 1 4.297 1 3.650 1 3.789 1 4.265 1 4.399 1 4.250 1 4.276 1 4.701 1 10.748 1 4.036 1 3.945 1 4.856 1 5.457 1 4.065 1 3.854 1 3.907 1 4.239 1 4.554 1 6.591 1 4.089 1 5.361 1 3.677 1 4.764 1 5.437 1 4.192 1 2.899 1 4.822 1 5.426 1 3.996 1 5.023 1 5.126 1 4.278 1 4.703 1 4.400 1 3.796 1 5.209 1 3.842 1 4.293 1 4.030 1 4.282 1 4.833 1 5.448 1 5.182 1 4.602 1 5.371 1 3.849 1 3.001 1 2.100 1 3.695 1 4.085 1 4.082 1 4.270 1 2.915 1 4.211 1 4.014 1 5.463 1 3.887 1 5.042 1 4.272 1 2.829 1 2.890 1 5.344 1 3.853 1 3.628 1 5.040 1 4.093 1 4.396 1 4.052 1 2.729 1 2.819 1 5.416 1 3.857 1 3.049 1 4.124 1 5.498 1 3.785 1 4.390 1 5.027 1 4.267 1 4.230 1 4.198 1 5.447 1 3.884 1 4.934 1 4.008 1 4.055 1 4.902 1 4.760 1 4.661 1 7.537 1 3.734 1 4.894 1 2.880 1 4.322 1 4.047 1 3.679 1 3.610 1 4.893 1 4.233 1 4.965 1 5.380 1 4.091 1 4.973 1 4.084 1 4.048 1 4.385 1 5.499 1 4.051 1 5.064 1 3.817 1 5.348 1 3.121 1 4.170 1 4.106 1 5.188 1 3.721 1 4.820 1 4.311 1 3.768 1 5.606 1 4.560 1 3.806 1 4.327 1 4.316 1 3.784 1 2.801 1 3.931 1 3.031 1 4.004 1 3.820 1 5.454 1 5.483 1 3.722 1 4.157 1 3.645 1 5.331 1 5.055 1 4.920 1 4.709 1 5.298 1 3.799 1 3.642 1 5.176 1 6.052 1 3.995 1 4.214 1 3.989 1 4.025 1 4.031 1 4.755 1 5.411 1 5.233 1 3.937 1 4.208 1 4.041 1 5.986 1 4.566 1 5.081 1 3.698 1 9.384 1 3.691 1 3.900 1 4.824 1 4.188 1 4.342 1 3.865 1 4.814 1 3.657 1 2.806 1 4.518 1 3.120 1 3.753 1 3.844 1 3.794 1 3.917 1 5.283 1 4.063 1 3.941 1 4.302 1 4.108 1 4.174 1 4.971 1 4.002 1 3.756 1 3.726 1 4.102 1 5.474 1 5.201 1 3.965 1 4.202 1 4.834 1 4.224 1 5.490 1 5.472 1 5.145 1 4.540 1 5.187 1 5.249 1 4.049 1 3.659 1 4.090 1 2.971 1 3.638 1 4.887 1 2.989 1 3.697 1 4.933 1 5.496 1 5.114 1 3.221 1 4.060 1 3.839 1 5.116 1 3.611 1 3.810 1 5.433 1 3.894 1 3.618 1 4.252 1 4.301 1 4.861 1 4.575 1 4.648 1 4.284 1 4.005 1 5.180 1 4.295 1 6.085 1 5.445 1 4.855 1 3.717 1 4.225 1 5.456 1 4.204 1 2.842 1 4.115 1 4.034 1 4.361 1 5.197 1 2.905 1 2.965 1 4.056 1 4.876 1 5.338 1 3.694 1 4.094 1 5.039 1 2.112 1 5.263 1 2.705 1 4.963 1 4.237 1 4.147 1 4.529 1 3.893 1 5.029 1 4.555 1 2.823 1 2.947 1 5.395 1 4.369 1 5.358 1 3.267 1 4.210 1 5.475 1 5.054 1 4.918 1 4.839 1 4.118 1 4.509 1 5.291 1 3.926 1 3.988 1 2.943 1 2.773 1 2.910 1 3.169 1 4.508 1 4.009 1 3.827 1 3.823 1 4.392 1 4.010 1 5.387 1 4.354 1 3.936 1 4.244 1 4.558 1 4.875 1 4.524 1 4.382 1 4.967 1 1.976 1 5.109 1 3.757 1 3.030 1 4.344 1 4.281 1 3.831 1 4.813 1 2.936 1 4.017 1 4.358 1 4.931 1 5.723 1 4.070 1 5.079 1 4.242 1 5.339 1 4.209 1 4.707 1 3.689 1 5.230 1 3.718 1 3.758 1 5.392 1 3.901 1 3.921 1 7.580 1 3.890 1 3.811 1 4.205 1 5.276 1 4.078 1 4.159 1 5.214 1 4.140 1 4.600 1 4.657 1 4.923 1 3.883 1 5.346 1 4.153 1 4.201 1 4.597 1 3.752 1 4.156 1 4.339 1 2.768 1 5.450 1 5.402 1 4.175 1 6.307 1 3.964 1 3.974 1 4.212 1 4.927 1 4.033 1 4.995 1 3.930 1 3.042 1 3.923 1 4.941 1 3.836 1 4.919 1 5.458 1 4.542 1 9.543 1 3.745 1 4.653 1 4.232 1 4.015 1 4.066 1 7.623 1 2.758 1 3.639 1 5.337 1 4.121 1 2.992 1 3.980 1 4.736 1 3.981 1 2.194 1 4.253 1 3.924 1 3.990 1 3.982 1 3.813 1 2.707 1 3.939 1 5.130 1 4.348 1 4.943 1 3.740 1 4.885 1 4.850 1 5.258 1 3.728 1 3.942 1 3.738 1 4.832 1 4.269 1 3.225 1 2.988 1 3.954 1 3.265 1 4.236 1 3.641 1 4.370 1 5.562 1 3.685 1 3.899 1 4.001 1 3.872 1 3.713 1 5.855 1 4.203 1 5.147 1 4.181 1 3.654 1 3.736 1 4.672 1 4.169 1 3.661 1 6.480 1 3.682 1 5.407 1 3.128 1 4.012 1 4.730 1 2.995 1 3.772 1 4.172 1 3.951 1 4.113 1 4.011 1 4.086 1 4.180 1 4.667 1 4.231 1 3.666 1 4.131 1 2.862 1 3.600 1 5.497 1 4.219 1 3.606 1 4.759 1 3.833 1 5.742 1 5.060 1 0.951 1 4.216 1 3.829 1 3.944 1 4.389 1 3.624 1 5.381 1 4.680 1 4.271 1 3.612 1 3.298 1 4.088 1 3.634 1 4.319 1 3.916 1 2.787 1 4.695 1 5.241 1 5.123 1 5.299 1 4.145 1 4.150 1 4.757 1 3.625 1 3.828 1 4.970 1 5.135 1 3.678 1 3.699 1 4.241 1 4.069 1 4.280 1 4.068 1 6.129 1 4.383 1 4.506 1 4.207 1 2.061 1 3.986 1 4.307 1 5.008 1 4.134 1 2.990 1 3.997 1 3.858 1 3.814 1 5.293 1 3.240 1 4.936 1 4.395 1 3.870 1 3.873 1 4.892 1 5.317 1 3.809 1 4.802 1 5.225 1 4.238 1 4.178 1 1.959 1 3.773 1 3.961 1 6.492 1 4.129 1 5.565 1 4.574 1 4.092 1 2.824 1 4.318 1 5.030 1 5.066 1 4.320 1 4.079 1 4.135 1 4.938 1 3.962 1 4.721 1 4.042 1 3.889 1 5.129 1 6.788 1 4.195 1 4.869 1 2.951 1 4.247 1 3.646 1 4.185 1 3.897 1 3.868 1 4.026 1 4.097 1 3.993 1 3.807 1 2.895 1 3.816 1 10.754 1 7.313 1 Name: count, dtype: int64 -------------------------------------------------- bathroom 1.000 4873 2.000 2742 3.000 421 4.000 121 5.000 41 1.094 36 0.986 35 1.051 35 0.939 34 1.060 34 1.079 34 1.058 33 1.059 33 0.966 32 0.935 32 1.010 32 0.981 32 1.022 32 0.991 32 0.995 31 1.037 31 0.922 31 1.044 31 0.958 30 0.933 30 1.032 30 1.004 29 0.903 29 1.098 29 1.045 29 1.053 29 1.068 29 0.999 29 0.959 29 0.901 29 1.075 29 1.076 29 0.945 29 1.074 28 0.950 28 1.018 28 1.043 28 0.974 28 0.944 27 0.930 27 1.061 27 0.942 27 1.085 27 0.980 27 1.092 27 1.049 27 0.961 27 0.976 27 1.020 27 0.962 26 1.015 26 0.924 26 1.089 26 1.041 26 0.911 26 0.994 26 0.953 26 0.919 26 0.905 26 0.955 26 1.008 26 1.029 26 0.934 26 0.947 26 0.996 25 1.066 25 0.948 25 1.091 25 0.912 25 0.998 25 1.050 25 1.021 25 0.926 25 1.026 25 0.975 25 0.973 24 1.047 24 1.093 24 1.023 24 1.080 24 1.003 24 0.907 24 1.017 24 1.038 24 1.099 24 1.070 24 1.056 24 0.967 24 1.025 24 0.906 24 1.009 24 0.931 24 1.035 24 0.964 24 1.046 24 1.078 24 1.086 23 1.069 23 1.013 23 0.972 23 0.923 23 0.984 23 0.940 23 1.002 23 0.946 23 1.097 23 0.963 23 0.978 23 1.024 23 0.929 22 0.943 22 1.065 22 0.965 22 1.087 22 1.036 22 0.949 22 0.993 22 0.985 22 1.033 22 1.030 21 0.927 21 0.990 21 0.917 21 0.983 21 0.968 21 0.916 21 1.001 21 0.957 21 0.928 21 1.012 21 0.938 21 0.997 21 1.088 21 0.925 21 0.913 20 0.970 20 1.019 20 1.014 20 1.031 20 1.071 20 1.005 20 0.936 20 1.054 20 1.081 20 1.073 20 0.960 20 1.062 20 0.954 20 0.909 19 1.007 19 1.067 19 0.900 19 0.952 19 1.072 19 1.052 19 1.063 19 0.979 19 0.989 19 1.042 19 0.982 18 1.039 18 0.932 18 0.914 18 1.077 18 1.048 18 1.090 18 1.095 18 1.083 18 1.064 18 0.921 18 0.971 18 0.918 18 1.027 17 0.987 17 0.908 17 0.941 17 0.969 17 1.040 17 1.084 16 0.951 16 0.904 16 1.057 16 1.034 16 1.028 16 0.937 16 1.096 15 0.988 15 1.006 15 0.902 15 1.055 15 0.956 15 1.082 14 0.977 14 2.051 14 0.920 14 2.125 14 2.183 14 1.995 14 0.992 14 2.088 14 1.849 13 0.910 13 1.957 13 1.943 13 2.169 13 1.973 13 1.933 12 2.043 12 2.075 12 2.173 12 2.145 12 2.115 12 1.980 12 1.801 12 1.011 12 1.986 12 1.802 11 1.861 11 1.951 11 1.962 11 2.070 11 2.002 11 1.983 11 1.016 11 2.127 11 2.082 11 2.103 11 1.870 11 1.975 11 2.155 11 1.998 10 1.901 10 1.914 10 1.907 10 2.157 10 2.008 10 1.968 10 1.965 10 2.143 10 2.038 10 2.077 10 1.915 10 2.122 10 1.823 10 1.881 10 1.868 10 1.879 10 2.129 10 2.175 10 1.885 10 1.859 10 1.889 10 1.961 10 0.915 10 1.903 9 6.000 9 1.860 9 1.905 9 1.949 9 1.920 9 1.840 9 1.865 9 1.900 9 1.936 9 1.892 9 2.181 9 1.819 9 2.091 9 2.080 9 2.049 9 1.896 9 2.072 9 2.165 9 1.844 9 1.994 9 1.852 9 2.137 9 1.843 9 1.977 9 2.171 9 1.846 9 2.140 9 2.142 9 2.113 9 2.161 9 1.864 9 1.834 9 1.805 9 1.100 9 1.992 9 1.813 9 2.139 9 2.074 8 1.818 8 1.947 8 2.066 8 1.945 8 1.929 8 1.891 8 2.154 8 2.031 8 2.135 8 2.076 8 2.054 8 1.871 8 2.105 8 1.897 8 2.124 8 1.836 8 2.014 8 2.068 8 1.875 8 1.878 8 2.177 8 2.019 8 2.022 8 2.048 8 2.111 8 2.184 8 2.093 8 1.893 8 2.192 8 1.910 8 2.001 8 2.100 8 1.988 8 2.166 8 2.034 8 2.046 8 2.098 8 1.841 8 1.987 8 1.810 8 2.193 8 1.886 8 2.097 8 2.196 8 1.950 8 2.032 8 1.908 8 1.985 8 2.029 8 1.997 8 2.156 8 1.815 8 1.921 7 1.982 7 2.109 7 1.817 7 2.164 7 2.021 7 2.004 7 1.874 7 1.984 7 1.991 7 1.906 7 1.829 7 1.873 7 2.087 7 1.926 7 1.806 7 1.918 7 2.007 7 1.931 7 2.017 7 2.035 7 1.970 7 2.050 7 2.033 7 2.052 7 2.090 7 1.820 7 1.974 7 2.106 7 1.922 7 2.024 7 2.110 7 1.960 7 2.079 7 2.078 7 2.061 7 2.144 7 2.138 7 2.040 7 1.867 7 1.953 7 2.003 7 1.902 7 1.894 7 1.972 7 1.850 7 2.084 7 2.036 7 1.971 7 1.938 7 1.940 7 2.067 7 2.005 7 1.827 7 2.015 7 2.185 7 1.913 7 2.053 7 2.095 7 2.189 7 1.999 7 1.851 7 1.927 6 1.952 6 2.168 6 2.085 6 2.042 6 2.112 6 1.863 6 2.028 6 1.993 6 1.979 6 2.141 6 2.114 6 1.812 6 2.006 6 2.170 6 2.041 6 1.830 6 1.948 6 1.944 6 2.025 6 1.966 6 2.134 6 2.148 6 1.862 6 1.912 6 1.963 6 2.195 6 2.188 6 2.086 6 2.191 6 1.990 6 2.064 6 1.821 6 1.925 6 1.939 6 2.136 6 2.149 6 2.187 6 1.942 6 2.117 6 2.162 6 1.909 6 1.837 6 2.198 6 2.083 6 1.967 6 1.930 6 1.877 6 1.978 6 1.856 6 1.954 6 2.179 6 2.178 6 2.126 6 1.832 5 1.826 5 2.116 5 2.121 5 2.146 5 2.159 5 1.923 5 1.941 5 2.047 5 2.147 5 2.039 5 2.060 5 2.128 5 1.816 5 2.094 5 1.989 5 2.099 5 2.120 5 1.883 5 1.946 5 2.071 5 1.855 5 2.107 5 2.058 5 2.167 5 1.811 5 1.959 5 1.899 5 2.018 5 2.009 5 2.118 5 2.065 5 2.153 5 1.898 5 2.123 5 2.062 5 2.023 5 1.858 5 1.854 5 2.133 5 1.934 5 2.197 5 2.089 5 1.911 5 2.151 5 1.831 5 1.904 5 1.824 5 2.081 5 1.822 5 1.955 5 1.996 4 2.108 4 2.063 4 1.917 4 2.092 4 1.857 4 2.069 4 2.104 4 2.160 4 2.132 4 1.887 4 2.174 4 1.872 4 2.013 4 2.130 4 2.044 4 1.919 4 1.935 4 1.964 4 2.797 4 1.814 4 1.845 4 1.808 4 2.073 4 2.055 4 1.976 4 1.956 4 2.172 4 2.030 4 1.916 4 1.848 4 2.150 4 2.176 4 2.194 4 1.937 4 2.180 4 2.186 4 2.016 4 2.026 4 1.890 4 2.182 4 1.838 4 3.249 3 1.932 3 1.958 3 2.056 3 1.853 3 2.059 3 1.969 3 2.131 3 1.880 3 2.944 3 1.804 3 1.928 3 1.876 3 2.119 3 2.785 3 2.102 3 2.027 3 1.888 3 1.842 3 1.809 3 3.173 3 1.882 3 1.803 3 1.884 3 3.078 3 2.163 3 2.010 3 3.201 3 3.298 3 2.199 3 2.818 3 2.711 3 4.393 3 3.026 3 2.722 3 2.907 3 1.828 3 3.074 3 2.841 3 1.895 3 3.004 2 2.045 2 1.847 2 2.784 2 3.234 2 2.827 2 3.245 2 1.839 2 2.152 2 1.833 2 3.143 2 3.079 2 3.256 2 3.204 2 3.067 2 2.920 2 1.866 2 2.851 2 3.138 2 2.706 2 2.897 2 2.838 2 3.154 2 3.039 2 3.005 2 1.869 2 3.244 2 3.278 2 2.895 2 3.162 2 2.020 2 2.703 2 3.085 2 3.262 2 3.947 2 3.166 2 5.299 2 2.756 2 2.928 2 3.252 2 2.731 2 2.906 2 2.739 2 3.012 2 3.214 2 1.825 2 3.184 2 3.215 2 3.264 2 4.309 2 2.983 2 2.707 2 2.964 2 3.002 2 2.096 2 3.149 2 1.981 2 3.259 2 2.747 2 3.192 2 2.767 2 2.868 2 2.878 2 3.086 2 2.200 2 2.190 2 3.721 2 3.093 2 3.261 2 2.986 2 3.869 2 3.095 2 3.970 2 2.720 2 3.190 2 3.076 2 2.742 2 2.995 2 3.241 2 2.876 2 2.993 2 2.057 2 2.957 2 2.101 2 2.792 2 5.093 2 2.721 2 2.751 2 2.857 2 3.161 2 2.898 2 3.101 2 3.009 2 2.910 2 8.000 2 2.866 1 3.041 1 3.187 1 4.903 1 3.061 1 3.228 1 3.232 1 2.976 1 2.959 1 2.904 1 3.703 1 3.224 1 5.908 1 3.139 1 3.081 1 2.829 1 4.679 1 3.863 1 2.824 1 4.252 1 4.914 1 4.325 1 4.255 1 3.052 1 3.275 1 4.502 1 2.914 1 3.146 1 3.054 1 3.976 1 3.855 1 4.208 1 3.040 1 4.374 1 2.913 1 2.886 1 2.991 1 4.841 1 3.027 1 4.746 1 3.010 1 2.852 1 4.900 1 2.912 1 4.295 1 3.025 1 3.627 1 3.297 1 3.719 1 3.294 1 3.219 1 4.158 1 3.021 1 2.885 1 2.916 1 3.803 1 3.251 1 4.233 1 2.884 1 2.863 1 1.924 1 3.051 1 2.814 1 3.769 1 3.231 1 4.203 1 2.940 1 3.140 1 1.807 1 3.107 1 5.857 1 4.745 1 4.370 1 4.132 1 4.340 1 5.717 1 3.885 1 5.393 1 3.178 1 3.175 1 4.929 1 2.861 1 4.855 1 3.072 1 3.152 1 4.119 1 4.854 1 3.705 1 2.773 1 2.977 1 3.133 1 3.290 1 2.799 1 3.640 1 3.601 1 2.781 1 4.326 1 3.287 1 2.816 1 3.058 1 2.771 1 2.975 1 3.242 1 5.490 1 3.642 1 4.112 1 4.353 1 2.011 1 5.384 1 5.263 1 2.758 1 2.809 1 3.267 1 2.905 1 3.008 1 3.937 1 3.070 1 2.709 1 3.125 1 2.888 1 4.689 1 3.695 1 2.037 1 3.254 1 3.213 1 3.129 1 3.263 1 4.701 1 2.718 1 3.221 1 4.044 1 3.293 1 1.800 1 2.911 1 3.189 1 4.285 1 2.821 1 2.736 1 4.139 1 5.251 1 2.840 1 5.759 1 4.754 1 2.755 1 3.230 1 3.206 1 2.754 1 3.202 1 2.871 1 3.239 1 2.778 1 4.140 1 3.177 1 4.198 1 4.055 1 2.733 1 3.113 1 3.073 1 5.366 1 4.323 1 3.952 1 2.936 1 3.094 1 5.474 1 3.104 1 5.063 1 3.159 1 4.668 1 2.958 1 2.880 1 2.930 1 2.815 1 3.281 1 3.216 1 4.789 1 2.903 1 2.808 1 3.276 1 2.925 1 2.812 1 2.853 1 4.028 1 4.266 1 3.015 1 3.277 1 3.218 1 3.034 1 5.431 1 2.973 1 3.176 1 2.856 1 3.050 1 2.788 1 2.947 1 3.222 1 2.854 1 4.659 1 3.257 1 4.964 1 5.322 1 2.990 1 3.077 1 3.273 1 3.265 1 3.236 1 3.200 1 3.111 1 4.343 1 4.279 1 2.867 1 4.025 1 4.826 1 3.209 1 2.732 1 3.698 1 5.195 1 4.333 1 3.155 1 4.050 1 4.011 1 4.695 1 4.116 1 3.128 1 4.312 1 3.291 1 3.959 1 2.847 1 3.296 1 2.830 1 5.218 1 2.719 1 4.141 1 3.185 1 2.780 1 2.762 1 3.150 1 3.199 1 2.819 1 2.804 1 2.921 1 2.887 1 4.899 1 2.803 1 3.836 1 2.761 1 3.829 1 7.160 1 2.994 1 4.773 1 3.614 1 3.007 1 5.444 1 3.762 1 3.188 1 3.044 1 3.602 1 4.377 1 3.725 1 4.371 1 3.655 1 2.806 1 4.335 1 3.279 1 3.123 1 2.848 1 3.103 1 2.926 1 1.835 1 3.283 1 2.765 1 4.948 1 2.890 1 4.297 1 4.113 1 2.813 1 2.823 1 3.271 1 4.797 1 3.047 1 2.894 1 2.774 1 3.135 1 4.400 1 2.850 1 2.702 1 3.068 1 2.846 1 3.205 1 2.929 1 7.000 1 2.862 1 5.898 1 3.046 1 2.946 1 2.787 1 2.704 1 3.087 1 3.300 1 3.119 1 4.690 1 2.701 1 2.791 1 4.105 1 3.953 1 4.196 1 3.182 1 3.780 1 2.012 1 3.255 1 3.083 1 3.975 1 4.204 1 2.158 1 2.724 1 3.246 1 2.826 1 2.743 1 4.321 1 4.064 1 3.053 1 2.828 1 Name: count, dtype: int64
- the variable 'rooms' will require feature engineering
- the variable 'bathroom' will require feature engineering
room_counts_list = []
# Iterate through each integer value of rooms
for i in range(1, 1 + int(max(df['rooms']))):
count = df['rooms'][df['rooms'] == i].count() # Count occurrences for the current value
room_counts_list.append({'rooms': i, 'count': count}) # Add result to the list
# Convert the list of dictionaries into a DataFrame
room_counts = pd.DataFrame(room_counts_list)
#calculate totals
int_rooms=room_counts['count'].sum()
room_counts['int_prop']=room_counts['count']/int_rooms
room_counts['net_prop']=room_counts['count']/16376
room_counts
| rooms | count | int_prop | net_prop | |
|---|---|---|---|---|
| 0 | 1 | 1600 | 0.200000 | 0.097704 |
| 1 | 2 | 2608 | 0.326000 | 0.159257 |
| 2 | 3 | 2461 | 0.307625 | 0.150281 |
| 3 | 4 | 1061 | 0.132625 | 0.064790 |
| 4 | 5 | 232 | 0.029000 | 0.014167 |
| 5 | 6 | 28 | 0.003500 | 0.001710 |
| 6 | 7 | 5 | 0.000625 | 0.000305 |
| 7 | 8 | 0 | 0.000000 | 0.000000 |
| 8 | 9 | 2 | 0.000250 | 0.000122 |
| 9 | 10 | 3 | 0.000375 | 0.000183 |
print(f'The total number of observations with an integer number for variable "rooms" is {room_counts['count'].sum()}, this represents {room_counts['net_prop'].sum()*100:.2f}% of total observations')
The total number of observations with an integer number for variable "rooms" is 8000, this represents 48.85% of total observations
bathroom_counts_list = []
# Iterate through each integer value of rooms
for i in range(1, 1 + int(max(df['bathroom']))):
count = df['bathroom'][df['bathroom'] == i].count() # Count occurrences for the current value
bathroom_counts_list.append({'bathroom': i, 'count': count}) # Add result to the list
# Convert the list of dictionaries into a DataFrame
bathroom_counts = pd.DataFrame(bathroom_counts_list)
#calculate totals
int_bathroom=bathroom_counts['count'].sum()
bathroom_counts['int_prop']=bathroom_counts['count']/int_rooms
bathroom_counts['net_prop']=bathroom_counts['count']/16376
bathroom_counts
| bathroom | count | int_prop | net_prop | |
|---|---|---|---|---|
| 0 | 1 | 4873 | 0.609125 | 0.297570 |
| 1 | 2 | 2742 | 0.342750 | 0.167440 |
| 2 | 3 | 421 | 0.052625 | 0.025708 |
| 3 | 4 | 121 | 0.015125 | 0.007389 |
| 4 | 5 | 41 | 0.005125 | 0.002504 |
| 5 | 6 | 9 | 0.001125 | 0.000550 |
| 6 | 7 | 1 | 0.000125 | 0.000061 |
| 7 | 8 | 2 | 0.000250 | 0.000122 |
print(f'The total number of observations with an integer number for variable "bathroom" is {bathroom_counts['count'].sum()}, this represents {bathroom_counts['net_prop'].sum()*100:.2f}% of total observations')
The total number of observations with an integer number for variable "bathroom" is 8210, this represents 50.13% of total observations
- Considering high proportion of invalid values (not integer) on variables 'rooms' and 'bathroom' (51.15% and 49.87%), and Project2 dataset is stated as an augmented version of Project1 dataset, is interpreted Project1 dataframe was augmented adding artificial data to make it larger, and in this process of Data Augmentation those observations with decimal values where not corrected to integers in Project2 dataset.
# Duplicates
print(df.duplicated().sum()) # Checking for duplicate entries in the data
0
- There are no duplicated observations
Missing Value handling¶
df2=df.copy()
null_counts = df2.isnull().sum()
null_percentage = (null_counts / len(df2)) * 100
null_summary = pd.DataFrame({'Null Count': null_counts,'Null Percentage': null_percentage.round(2)})
null_summary
| Null Count | Null Percentage | |
|---|---|---|
| Unnamed: 0 | 0 | 0.00 |
| price | 0 | 0.00 |
| rooms | 410 | 2.50 |
| bathroom | 387 | 2.36 |
| lift | 0 | 0.00 |
| terrace | 0 | 0.00 |
| square_meters | 408 | 2.49 |
| real_state | 918 | 5.61 |
| neighborhood | 0 | 0.00 |
| square_meters_price | 439 | 2.68 |
# Create a new dataframe with rows that contain at least one missing value
df_missing = df[df.isnull().any(axis=1)]
# Reset index for better readability (optional)
df_missing = df_missing.reset_index(drop=True)
df_missing.shape
(2311, 10)
df_missing.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2311 entries, 0 to 2310 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 2311 non-null int64 1 price 2311 non-null int64 2 rooms 1901 non-null float64 3 bathroom 1924 non-null float64 4 lift 2311 non-null bool 5 terrace 2311 non-null bool 6 square_meters 1903 non-null float64 7 real_state 1393 non-null object 8 neighborhood 2311 non-null object 9 square_meters_price 1872 non-null float64 dtypes: bool(2), float64(4), int64(2), object(2) memory usage: 149.1+ KB
mask1 = df2["square_meters"].isna() & df2["price"].notna() & df2["square_meters_price"].notna()
df2.loc[mask1, "square_meters"] = df2["price"] / df2["square_meters_price"]
df2.isnull().sum() # Checking for missing values in the data
Unnamed: 0 0 price 0 rooms 410 bathroom 387 lift 0 terrace 0 square_meters 19 real_state 918 neighborhood 0 square_meters_price 439 dtype: int64
- 389 out of 408 missing "square_meters" values are imputed considering relation "price" / "square_meters_price"
mask2 = df2["square_meters"].notna() & df2["price"].notna() & df2["square_meters_price"].isna()
df2.loc[mask2, "square_meters_price"] = df2["price"] / df2["square_meters"]
df2.isnull().sum() # Checking for missing values in the data
Unnamed: 0 0 price 0 rooms 410 bathroom 387 lift 0 terrace 0 square_meters 19 real_state 918 neighborhood 0 square_meters_price 19 dtype: int64
- 420 out of 439 missing "square_meters_price" values are imputed considering relation "price" / "square_meters"
df2[(df2['square_meters_price'].isnull())&(df2['square_meters'].isnull())]
| Unnamed: 0 | price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 8748 | 8748 | 1300 | 4.392 | 1.980 | True | False | NaN | flat | Eixample | NaN |
| 8784 | 8784 | 850 | 0.950 | 0.995 | False | False | NaN | flat | Sants-Montjuïc | NaN |
| 9118 | 9118 | 925 | 2.175 | 0.924 | True | False | NaN | flat | Gràcia | NaN |
| 9321 | 9321 | 895 | NaN | 1.877 | True | False | NaN | flat | Sants-Montjuïc | NaN |
| 9442 | 9442 | 800 | 0.924 | 1.092 | False | False | NaN | flat | Horta- Guinardo | NaN |
| 9519 | 9519 | 995 | 2.858 | 2.161 | False | True | NaN | flat | Eixample | NaN |
| 10167 | 10167 | 1218 | 3.686 | 2.140 | True | False | NaN | flat | Sarria-Sant Gervasi | NaN |
| 11180 | 11180 | 600 | NaN | 0.917 | True | True | NaN | flat | Sants-Montjuïc | NaN |
| 11496 | 11496 | 945 | 1.009 | 0.962 | True | True | NaN | attic | Sarria-Sant Gervasi | NaN |
| 11959 | 11959 | 850 | 3.039 | NaN | True | False | NaN | flat | Horta- Guinardo | NaN |
| 12782 | 12782 | 1000 | 3.226 | 0.965 | False | False | NaN | flat | Eixample | NaN |
| 13086 | 13086 | 790 | 3.637 | 0.963 | True | True | NaN | flat | Eixample | NaN |
| 13189 | 13189 | 5300 | NaN | 4.695 | False | True | NaN | flat | Les Corts | NaN |
| 13401 | 13401 | 850 | 0.973 | 0.917 | True | False | NaN | apartment | Eixample | NaN |
| 13817 | 13817 | 1100 | 2.937 | 2.169 | False | False | NaN | flat | Gràcia | NaN |
| 14693 | 14693 | 740 | 1.068 | 0.949 | True | False | NaN | flat | Eixample | NaN |
| 15761 | 15761 | 1140 | 2.179 | 0.942 | True | False | NaN | attic | Gràcia | NaN |
| 16118 | 16118 | 1500 | 3.255 | 1.986 | True | False | NaN | flat | Sarria-Sant Gervasi | NaN |
| 16181 | 16181 | 800 | 1.050 | 1.082 | True | False | NaN | flat | Gràcia | NaN |
- There are 19 properties missing values on both "square_meters" and "square_meters_price"
df2['square_meters_price'] = df2['square_meters_price'].fillna(df2.groupby(['real_state', 'neighborhood'])['square_meters_price'].transform('mean'))
df2.isnull().sum() # Checking for missing values in the data
Unnamed: 0 0 price 0 rooms 410 bathroom 387 lift 0 terrace 0 square_meters 19 real_state 918 neighborhood 0 square_meters_price 0 dtype: int64
- 19 missing "square_meters_price" values are imputed with the most relevant mean based on the "real_state" and "neighborhood".
df2.loc[df2['square_meters'].isna(), 'square_meters'] = df2['price'] / df2['square_meters_price']
df2.isnull().sum() # Checking for missing values in the data
Unnamed: 0 0 price 0 rooms 410 bathroom 387 lift 0 terrace 0 square_meters 0 real_state 918 neighborhood 0 square_meters_price 0 dtype: int64
- 19 missing "square_meters" values are imputed considering relation "price" / "square_meters_price"
# Compute the most common (mode) real_state for each neighborhood
mode_real_state = df2.groupby("neighborhood")["real_state"].apply(lambda x: x.mode()[0] if not x.mode().empty else np.nan)
# Fill missing values in real_state based on the mode of each neighborhood
df2["real_state"] = df2["real_state"].fillna(df2["neighborhood"].map(mode_real_state))
df2.isnull().sum() # Checking for missing values in the data
Unnamed: 0 0 price 0 rooms 410 bathroom 387 lift 0 terrace 0 square_meters 0 real_state 0 neighborhood 0 square_meters_price 0 dtype: int64
- Imputed missing "real_state" values by filling them with the most common (mode) "real_state" for each "neighborhood".
#df2['rooms'] = df2['rooms'].fillna(df2.groupby(['real_state', 'neighborhood'])['rooms'].transform('mean'))
df2['rooms'] = df2['rooms'].fillna(df2.groupby(['real_state', 'neighborhood'])['rooms'].transform('median'))
df2.isnull().sum() # Checking for missing values in the data
Unnamed: 0 0 price 0 rooms 0 bathroom 387 lift 0 terrace 0 square_meters 0 real_state 0 neighborhood 0 square_meters_price 0 dtype: int64
- 410 missing "rooms" values are imputed with the most relevant median based on the "real_state" and "neighborhood".
#df2['bathroom'] = df2['bathroom'].fillna(df2.groupby(['real_state', 'neighborhood'])['bathroom'].transform('mean'))
df2['bathroom'] = df2['bathroom'].fillna(df2.groupby(['real_state', 'neighborhood'])['bathroom'].transform('median'))
df2.isnull().sum() # Checking for missing values in the data
Unnamed: 0 0 price 0 rooms 0 bathroom 0 lift 0 terrace 0 square_meters 0 real_state 0 neighborhood 0 square_meters_price 0 dtype: int64
- 387 missing "bathroom" values are imputed with the most relevant median based on the "real_state" and "neighborhood".
Feature engineering¶
df3=df2.copy()
df3=df3.drop(['Unnamed: 0'],axis=1)
df3.head()
| price | rooms | bathroom | lift | terrace | square_meters | real_state | neighborhood | square_meters_price | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 750 | 3.0 | 1.0 | True | False | 60.0 | flat | Horta- Guinardo | 12.500000 |
| 1 | 770 | 2.0 | 1.0 | True | False | 59.0 | flat | Sant Andreu | 13.050847 |
| 2 | 1300 | 1.0 | 1.0 | True | True | 30.0 | flat | Gràcia | 43.333333 |
| 3 | 2800 | 1.0 | 1.0 | True | True | 70.0 | flat | Ciutat Vella | 40.000000 |
| 4 | 720 | 2.0 | 1.0 | True | False | 44.0 | flat | Sant Andreu | 16.363636 |
- Removed the variable "Unnamed: 0" which had no value for modeling
df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16376 entries, 0 to 16375 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 16376 non-null int64 1 rooms 16376 non-null float64 2 bathroom 16376 non-null float64 3 lift 16376 non-null bool 4 terrace 16376 non-null bool 5 square_meters 16376 non-null float64 6 real_state 16376 non-null object 7 neighborhood 16376 non-null object 8 square_meters_price 16376 non-null float64 dtypes: bool(2), float64(4), int64(1), object(2) memory usage: 927.7+ KB
# Select rows where 'rooms' or 'bathroom' contain float values
df3_float = df3[df3[["rooms", "bathroom"]].select_dtypes(include=["float64"]).notna().any(axis=1)]
df3_float.shape
(16376, 9)
df3['rooms'] = df3['rooms'].apply(lambda x: 1 if x < 1 else round(x)).astype(int)
df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16376 entries, 0 to 16375 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 16376 non-null int64 1 rooms 16376 non-null int64 2 bathroom 16376 non-null float64 3 lift 16376 non-null bool 4 terrace 16376 non-null bool 5 square_meters 16376 non-null float64 6 real_state 16376 non-null object 7 neighborhood 16376 non-null object 8 square_meters_price 16376 non-null float64 dtypes: bool(2), float64(3), int64(2), object(2) memory usage: 927.7+ KB
df3['bathroom'] = df3['bathroom'].apply(lambda x: 1 if x < 1 else round(x)).astype(int)
df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16376 entries, 0 to 16375 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 16376 non-null int64 1 rooms 16376 non-null int64 2 bathroom 16376 non-null int64 3 lift 16376 non-null bool 4 terrace 16376 non-null bool 5 square_meters 16376 non-null float64 6 real_state 16376 non-null object 7 neighborhood 16376 non-null object 8 square_meters_price 16376 non-null float64 dtypes: bool(2), float64(2), int64(3), object(2) memory usage: 927.7+ KB
- Transformed the values of "rooms" and "bathroom" into an integer using the following logic:
- Values under 1 → Set to 1
- Values 1 or above → Round to the nearest integer
- Variables "rooms" and "bathroom" set as integer
df3.describe(include="all").T # statistical summary of the data.
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| price | 16376.0 | NaN | NaN | NaN | 1437.04586 | 1106.831419 | 320.0 | 875.0 | 1100.0 | 1514.0 | 15000.0 |
| rooms | 16376.0 | NaN | NaN | NaN | 2.447545 | 1.078844 | 1.0 | 2.0 | 2.0 | 3.0 | 11.0 |
| bathroom | 16376.0 | NaN | NaN | NaN | 1.495237 | 0.714843 | 1.0 | 1.0 | 1.0 | 2.0 | 8.0 |
| lift | 16376 | 2 | True | 11246 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| terrace | 16376 | 2 | False | 12770 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| square_meters | 16376.0 | NaN | NaN | NaN | 84.357363 | 47.454864 | 10.0 | 56.048 | 72.689 | 95.0 | 679.0 |
| real_state | 16376 | 4 | flat | 13568 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| neighborhood | 16376 | 10 | Eixample | 4795 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| square_meters_price | 16376.0 | NaN | NaN | NaN | 17.727253 | 9.185362 | 4.549 | 12.773723 | 15.315158 | 19.389167 | 197.272 |
Outliers detection and treatment¶
# function to check for outliers
def count_outliers(df):
outlier_count=0
for column in df.select_dtypes(include=np.number).columns:
outliers=len(df[(df[column] < df[column].quantile(0.25)-1.5*(df[column].quantile(0.75)-df[column].quantile(0.25))) | (df[column] > df[column].quantile(0.75)+1.5*(df[column].quantile(0.75)-df[column].quantile(0.25)))][column])
print(f'{column}: {outliers} outliers ({outliers/df.shape[0]*100:.2f}%)')
outlier_count+= outliers
return outlier_count
df4=df3.copy()
count_outliers(df)
Unnamed: 0: 0 outliers (0.00%) price: 1778 outliers (10.86%) rooms: 870 outliers (5.31%) bathroom: 308 outliers (1.88%) square_meters: 1177 outliers (7.19%) square_meters_price: 1165 outliers (7.11%)
5298
df.shape
(16376, 10)
count_outliers(df4)
price: 1778 outliers (10.86%) rooms: 505 outliers (3.08%) bathroom: 308 outliers (1.88%) square_meters: 1206 outliers (7.36%) square_meters_price: 1201 outliers (7.33%)
4998
df4.shape
(16376, 9)
# Z-Score Method
df5=df4[(np.abs(df4.select_dtypes(include=np.number).apply(zscore))<2).all(axis=1)] #drop over 2 standard deviations
count_outliers(df5)
price: 960 outliers (6.73%) rooms: 0 outliers (0.00%) bathroom: 0 outliers (0.00%) square_meters: 499 outliers (3.50%) square_meters_price: 593 outliers (4.16%)
2052
df5.shape
(14269, 9)
- Applied the Z-score method, which removes outliers with more than 2 standard deviations.
- Some variables with a relevant percentage of outliers still remain. df5_shape:(14269, 9)
df6=df5.copy()
for column in df6.select_dtypes(include=np.number).columns:
df6[column]=np.clip(df6[column], df6[column].quantile(0.25)-1.5*(df6[column].quantile(0.75)-df6[column].quantile(0.25)), df6[column].quantile(0.75)+1.5*(df6[column].quantile(0.75)-df6[column].quantile(0.25)))
df6.shape
(14269, 9)
count_outliers(df6)
price: 0 outliers (0.00%) rooms: 0 outliers (0.00%) bathroom: 0 outliers (0.00%) square_meters: 0 outliers (0.00%) square_meters_price: 0 outliers (0.00%)
0
- Limiting outliers at whiskers (winsorization) is considered due to the nature of the data
- Applying winsorization can hide valuable trends in luxury or budget properties, but in this case the extreme prices are assumed to be errors or anomalies in the synthetic or augmented data, so applying winsorization will make the model more robust to those outliers.
df6.info()
<class 'pandas.core.frame.DataFrame'> Index: 14269 entries, 0 to 16375 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 14269 non-null int64 1 rooms 14269 non-null int64 2 bathroom 14269 non-null int64 3 lift 14269 non-null bool 4 terrace 14269 non-null bool 5 square_meters 14269 non-null float64 6 real_state 14269 non-null object 7 neighborhood 14269 non-null object 8 square_meters_price 14269 non-null float64 dtypes: bool(2), float64(2), int64(3), object(2) memory usage: 919.7+ KB
df6.describe(include="all").T # statistical summary of the data.
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| price | 14269.0 | NaN | NaN | NaN | 1124.019833 | 371.448532 | 320.0 | 850.0 | 1000.0 | 1300.0 | 1975.0 |
| rooms | 14269.0 | NaN | NaN | NaN | 2.308291 | 0.939464 | 1.0 | 2.0 | 2.0 | 3.0 | 4.0 |
| bathroom | 14269.0 | NaN | NaN | NaN | 1.340949 | 0.474045 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 |
| lift | 14269 | 2 | True | 9753 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| terrace | 14269 | 2 | False | 11384 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| square_meters | 14269.0 | NaN | NaN | NaN | 73.41199 | 25.366439 | 10.313 | 55.0 | 70.0 | 87.264 | 135.66 |
| real_state | 14269 | 4 | flat | 12201 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| neighborhood | 14269 | 10 | Eixample | 4154 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| square_meters_price | 14269.0 | NaN | NaN | NaN | 16.1397 | 4.684702 | 6.001 | 12.676 | 15.0 | 18.681319 | 27.689297 |
Data Management¶
df.to_csv('df_ORIGINAL_DATA.csv', index=False) # Save a copy of original data
df_missing.to_csv('df_MISSING_DATA.csv', index=False) # Save a copy of missing data to be imputed
df2.to_csv('df2_IMPUTED_DATA.csv', index=False) # Save a copy of data after imputitation of missing values
df3_float.to_csv('df3_WRONG FEATURES_DATA.csv', index=False) # Save a copy of data before feature engineering
df3.to_csv('df3_FATURE ENGINEERED_DATA.csv', index=False) # Save a copy of data after feature engineering
df6.to_csv('df6_WITHOUT OUTLIERS_DATA.csv', index=False) # Save a copy of data after outliers handling
- 'df_ORIGINAL_DATA.csv': Reference dataset as a copy of original data.
- 'df_MISSING_DATA.csv': Data subset filtered by missing value data.
- 'df2_IMPUTED_DATA.csv': Updated dataset after the imputation of the missing values.
- 'df3_WRONG FEATURES_DATA.csv': Data subset filtered by data subject to feature engineering.
- 'df3_FEATURE ENGINEERED_DATA.csv': Updated dataset after feature engineering.
- 'df6_WITHOUT OUTLIERS_DATA.csv': Updated dataset after handling outliers
Data Preparation Consolidated Notes¶
Data Overview
- The variable 'Unnamed' represent index and should be deleted from data
- Target variable for modeling is "price"
- There are 16376 rows and 10 columns.
- Project1 data had 8188 rows and 10 columns.
- Data types are aligned with information, except variables 'rooms' and 'bathroom' being float and expected integer
- There are missing data (NaN) on multiple variables
- Units size goes from 10m2 to 679m2, with a mean of 84.36m2
- Units prices goes from 320EUR to 15000EUR/month, with mean of 1437EUR/month
- price range is assumed referred to monthly rent, so considered as EUR per month
- Units prices by square meter goes from 4.549EUR/m2/month to 197.272EUR/m2/month, with mean of 17.73EUR/m2/month
- There are units listed with cero rooms and 10.754 rooms
- There are units with 0.9 bathroom
- There are four types of real states being the most common "flat"
- Most units do not have terrace
- Most units do have lift
- The neighborhood with largest unit count is "Eixample"
- The variable 'rooms' will require feature engineering
- The variable 'bathroom' will require feature engineering
- The total number of observations with an integer number for variable "rooms" is 8000, this represents 48.85% of total observations
- The total number of observations with an integer number for variable "bathroom" is 8210, this represents 50.13% of total observations
- Considering high proportion of invalid values (not integer) on variables 'rooms' and 'bathroom' (51.15% and 49.87%), and Project2 dataset is stated as an augmented version of Project1 dataset, is interpreted Project1 dataset was augmented adding artificial data to make it larger, and in this process of Data Augmentation those observations with decimal values where not corrected to integers into Project2 dataset.
- There are no duplicated observations
Missing Value handling
- 389 out of 408 missing "square_meters" values are imputed considering relation "price" / "square_meters_price"
- 420 out of 439 missing "square_meters_price" values are imputed considering relation "price" / "square_meters"
- There are 19 properties missing values on both "square_meters" and "square_meters_price"
- 19 missing "square_meters_price" values are imputed with the most relevant mean based on the "real_state" and "neighborhood".
- 19 missing "square_meters" values are imputed considering relation "price" / "square_meters_price"
- Imputed missing "real_state" values by filling them with the most common (mode) "real_state" for each "neighborhood".
- 408 missing "rooms" values are imputed with the most relevant median based on the "real_state" and "neighborhood".
- 387 missing "bathroom" values are imputed with the most relevant median based on the "real_state" and "neighborhood".
Feature engineering
- Removed the variable "Unnamed: 0" which had no value for modeling
- Transformed the values of "rooms" and "bathroom" into an integer using the following logic:
- Values under 1 → Set to 1
- Values 1 or above → Round to the nearest integer
- Variables "rooms" and "bathroom" set as integer
Outliers detection and treatment
- Applied the Z-score method, which removes outliers with more than 2 standard deviations.
- Some variables with a relevant percentage of outliers still remain. df5_shape:(14269, 9)
- Limiting outliers at whiskers (winsorization) is considered due to the nature of the data
- Applying winsorization can hide valuable trends in luxury or budget properties, but in this case the extreme prices are assumed to be errors or anomalies in the synthetic or augmented data, so applying winsorization will make the model more robust to those outliers.
Data Management
- 'df_ORIGINAL_DATA.csv': Reference dataset as a copy of original data.
- 'df_MISSING_DATA.csv': Data subset filtered by missing value data.
- 'df2_IMPUTED_DATA.csv': Updated dataset after the imputation of the missing values.
- 'df3_WRONG FEATURES_DATA.csv': Data subset filtered by data subject to feature engineering.
- 'df3_FEATURE ENGINEERED_DATA.csv': Updated dataset after feature engineering.
- 'df6_WITHOUT OUTLIERS_DATA.csv': Updated dataset after handling outliers
4. Exploratory Data Analysis (EDA)¶
Analyzing the data to understand patterns, relationships, and potential anomalies. This step often involves data visualization and statistical analysis to generate insights.
EDA Functions¶
def univariate_numerical(data):
'''
Function to generate two plots for each numerical variable
Histplot for variable distribution
Boxplot for statistical summary
'''
# Select numerical columns
numerical_cols = data.select_dtypes(include=[np.number]).columns
# Determine the number of rows and columns
num_vars = len(numerical_cols)
num_cols = 4
num_rows = int(np.ceil(num_vars * 2 / num_cols))
# Create a figure with the specified size
fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5))
# Flatten the axes array for easy iteration
axes = axes.flatten()
# Plot each variable with a histplot and a boxplot
for i, col in enumerate(numerical_cols):
mean_value = data[col].mean()
# Histplot with KDE
sns.histplot(data[col], kde=True, ax=axes[i*2])
axes[i*2].axvline(mean_value, color='r', linestyle='--')
axes[i*2].set_title(f'Distribution of {col}')
axes[i*2].text(mean_value, axes[i*2].get_ylim()[1]*0.8, f'Mean: {mean_value:.2f}', color='r', va='baseline', ha='left',rotation=90)
# Boxplot
sns.boxplot(y=data[col], ax=axes[i*2 + 1])
axes[i*2 + 1].axhline(mean_value, color='r', linestyle='--')
axes[i*2 + 1].set_title(f'Boxplot of {col}')
axes[i*2 + 1].text(axes[i*2 + 1].get_xlim()[1]*0.8, mean_value, f'mean: {mean_value:.2f}', color='r', va='baseline', ha='right')
# Hide any remaining empty subplots
for j in range(num_vars * 2, len(axes)):
fig.delaxes(axes[j])
# Adjust layout
plt.tight_layout()
plt.show()
def univariate_categorical(data):
'''
Function to generate countplot for each categorical variable
Labeled with count and percentage
'''
# List of categorical columns
categorical_columns = data.select_dtypes(include=['object', 'category']).columns.tolist()
# Number of columns in the grid
num_cols = 4
# Calculate the number of rows needed
num_rows = (len(categorical_columns) + num_cols - 1) // num_cols
# Create the grid
fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
axes = axes.flatten()
# Plot each countplot in the grid
for i, col in enumerate(categorical_columns):
ax = axes[i]
plot = sns.countplot(x=col, data=data, order=data[col].value_counts().index, ax=ax)
ax.set_title(f'Count of {col}')
# Add total count and percentage annotations
total = len(data)
for p in plot.patches:
height = p.get_height()
percentage = f'{(height / total * 100):.1f}%'
plot.text(x=p.get_x() + p.get_width() / 2,
y=height + 2,
s=f'{height:.0f}\n({percentage})',
ha='center')
# Limit x-axis labels to avoid overlap
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
# Remove any empty subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
# Show the plot
plt.show()
# Function to plot crosstab with labels
def plot_crosstab_bar_count(df, var_interest):
'''
Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables
Labeled with counts
'''
# Extract categorical columns excluding the variable of interest
cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
cat_cols.remove(var_interest)
# Determine the grid size
num_vars = len(cat_cols)
num_cols = 3 # Number of columns in the grid
num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)
# Create a grid of subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
axes = axes.flatten() # Flatten the axes array for easy iteration
for i, col in enumerate(cat_cols):
# Create a crosstab
crosstab = pd.crosstab(df[col], df[var_interest])
# Plot the crosstab as a bar plot
crosstab.plot(kind='bar', stacked=True, ax=axes[i])
# Annotate counts in the middle of each bar section
for bar in axes[i].patches:
height = bar.get_height()
if height > 0:
axes[i].annotate(f'{int(height)}',
(bar.get_x() + bar.get_width() / 2, bar.get_y() + height / 2),
ha='center', va='center', fontsize=10, color='black')
# Add total labels at the top of each bar
totals = crosstab.sum(axis=1)
for j, total in enumerate(totals):
axes[i].annotate(f'Total: {total}',
(j, totals[j]),
ha='center', va='bottom', weight='bold')
# Hide any remaining empty subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
# Usage
#plot_crosstab_bar_count(df, var_interest='var_interest')
def plot_crosstab_heat_perc(df, var_interest,df_name="DataFrame"):
'''
Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables
Labeled with counts, percentage by row, percentage by column
'''
# Extract categorical columns excluding the variable of interest
cat_cols = df.select_dtypes(include=['category', 'object']).columns.tolist()
cat_cols.remove(var_interest)
# Determine the grid size
num_vars = len(cat_cols)
num_cols = 3 # Number of columns in the grid
num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)
# Create a grid of subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(6*num_cols, num_rows * 6))
axes = axes.flatten() # Flatten the axes array for easy iteration
for i, col in enumerate(cat_cols):
# Create crosstabs
crosstab = pd.crosstab(df[col], df[var_interest])
crosstab_perc_row = crosstab.div(crosstab.sum(axis=1), axis=0) * 100
crosstab_perc_col = crosstab.div(crosstab.sum(axis=0), axis=1) * 100
# Combine counts with percentages
crosstab_combined = crosstab.astype(str) + "\n" + \
crosstab_perc_row.round(2).astype(str) + "%" + "\n" + \
crosstab_perc_col.round(2).astype(str) + "%"
# Plot the crosstab as a heatmap
sns.heatmap(crosstab, annot=crosstab_combined, fmt='', cmap='Blues', ax=axes[i], cbar=False, annot_kws={"size": 8})
axes[i].set_title(f'Crosstab of {col} and {var_interest} - {df_name}', fontsize=12)
# Hide any remaining empty subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
# Adjust layout to prevent label overlapping
plt.subplots_adjust(hspace=0.4, wspace=0.4) # Add more space between subplots
plt.tight_layout()
plt.show()
# Usage
#plot_crosstab_heat_perc(df, var_interest='var_interest')
def boxplot_by_group(df, group, var, outliers, df_name="DataFrame"):
'''
boxplot for a numerical variable of interest vs a categorical variable
with or without outliers
includes data mean and mean by category
'''
# Calculate the average for the variable
var_avg = df[var].mean()
# Calculate variable mean per group
var_means = df.groupby(group)[var].mean()
# Sort by means and get the sorted order
var_sorted = var_means.sort_values(ascending=False).index
# Reorder the DataFrame by the sorted group
df[group] = pd.Categorical(df[group], categories=var_sorted, ordered=True)
# Create the boxplot with the reordered sectors
ax = sns.boxplot(data=df, x=group, y=var, order=var_sorted, showfliers=outliers)
# Add horizontal line for average variable value
plt.axhline(var_avg, color='red', linestyle='--', label=f'Avg {var}: {var_avg:.2f}')
# Scatter plot for means
x_positions = range(len(var_means.sort_values(ascending=False)))
plt.scatter(x=x_positions, y=var_means.sort_values(ascending=False), color='red', label='Mean', zorder=5)
# Add labels to each red dot with the mean value
for i, mean in enumerate(var_means.sort_values(ascending=False)):
plt.text(i, mean, f'{mean:.2f}', color='red', ha='center', va='bottom')
# Rotate x-axis labels
plt.xticks(ticks=x_positions, labels=var_means.sort_values(ascending=False).index, rotation=90)
# Add a legend
plt.legend()
plt.xlabel('') # Remove x-axis title
# Add plot title with DataFrame name
plt.title(f'Boxplot of {var} by {group} - {df_name}')
# Adjust layout
plt.tight_layout()
# Display the plot
#plt.show()
# Get the top 3 categories
top_3_categories = var_means.sort_values(ascending=False).head(3).index.tolist()
top_3=",".join(top_3_categories)
# Print the top 3 categories
print(f'Top 3 {group} by {var} mean value are: {top_3}')
# Define the function to create and display side-by-side boxplots
def side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2):
fig, axes = plt.subplots(1, 2, figsize=(18, 6), sharey=True)
# First subplot for df1
plt.sca(axes[0])
boxplot_by_group(df1, group, var, outliers, title1)
# Second subplot for df2
plt.sca(axes[1])
boxplot_by_group(df2, group, var, outliers, title2)
# Show both plots after setup
plt.show()
# Usage
#side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', True, "All units (show outliers)", "Popular units (show outliers)")
Functions
- univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
- univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
- plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
- plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
- boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category
- side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2): to present two side by side boxplot_by_group
Univariate Analysis¶
univariate_numerical(df)
univariate_numerical(df6)
- 'price', 'square_meters' and 'square_meters_price' variables are right skewed and reflect the effect of capping outliers to upper whysker.
- Comparing original data (df) vs. prepared data (df6) is noticeable how in original data the numerical variables have float type values and many outliers, while in prepared data the numerical variables have integer values and no outliers.
univariate_categorical(df6)
df6.loc[(df6['real_state']=="flat")].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| price | 12201.0 | 1097.272601 | 344.695856 | 320.000 | 850.000 | 1000.000 | 1250.000000 | 1975.000000 |
| rooms | 12201.0 | 2.384067 | 0.930198 | 1.000 | 2.000 | 2.000 | 3.000000 | 4.000000 |
| bathroom | 12201.0 | 1.353004 | 0.477923 | 1.000 | 1.000 | 1.000 | 2.000000 | 2.000000 |
| square_meters | 12201.0 | 74.623287 | 24.862834 | 13.181 | 56.559 | 70.902 | 88.388000 | 135.660000 |
| square_meters_price | 12201.0 | 15.435230 | 4.183763 | 6.001 | 12.465 | 14.516 | 17.647059 | 27.689297 |
df.loc[(df['real_state']=="flat")].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | 12650.0 | 8038.174625 | 4750.342568 | 0.000000 | 3887.250000 | 7965.5000 | 12162.75000 | 16375.000 |
| price | 12650.0 | 1311.412490 | 917.152962 | 320.000000 | 865.000000 | 1050.0000 | 1352.00000 | 15000.000 |
| rooms | 12351.0 | 2.551887 | 1.091363 | 0.000000 | 2.000000 | 2.7380 | 3.03600 | 10.754 |
| bathroom | 12380.0 | 1.509471 | 0.715738 | 0.900000 | 1.000000 | 1.0405 | 2.00000 | 8.000 |
| square_meters | 12352.0 | 85.484011 | 45.657731 | 10.540000 | 59.000000 | 74.7985 | 95.64450 | 679.000 |
| square_meters_price | 12322.0 | 15.707694 | 5.333934 | 5.555556 | 12.437625 | 14.5000 | 17.67775 | 103.176 |
- In the prepared data there are flats units with 4 rooms and 135m2 area.
- In the original data there are flats units with 10.754 rooms and 679m2 area.
- The "large flats" units in the data are asummed as unreal/not-valid data and are affected by Data Preparation.
df6.loc[(df6['neighborhood']=="Eixample")].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| price | 4154.0 | 1186.479779 | 376.948534 | 425.000 | 900.00000 | 1100.0000 | 1400.0000 | 1975.000000 |
| rooms | 4154.0 | 2.403707 | 0.940306 | 1.000 | 2.00000 | 2.0000 | 3.0000 | 4.000000 |
| bathroom | 4154.0 | 1.391911 | 0.488236 | 1.000 | 1.00000 | 1.0000 | 2.0000 | 2.000000 |
| square_meters | 4154.0 | 76.669525 | 25.506765 | 16.197 | 58.00000 | 74.0505 | 90.2625 | 135.660000 |
| square_meters_price | 4154.0 | 16.357695 | 4.795151 | 6.074 | 12.79825 | 15.0455 | 19.1155 | 27.689297 |
- The categorical variables are not balanced, with 85.5% of properties as "flats" and 78.5% of units concentrated in 50% of the sample neighbourhoods
- 75% of flats units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85.48m2.
- 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 80.21m2.
Bivariate Analysis¶
# Create a PairGrid
g = sns.PairGrid(df6, corner=True)
# Map different plots to the grid
g.map_lower(sns.scatterplot)
g.map_diag(sns.histplot,kde=True)
# Show the plot
plt.show()
# Calculate correlation matrix
corr_matrix = df6.select_dtypes(include=np.number).corr()
# Plot correlation matrix as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()
# Display the sorted correlation table
corr_unstacked = corr_matrix.unstack() # Unstack the correlation matrix
corr_unstacked = corr_unstacked.reset_index() # Reset the index to get 'variable1' and 'variable2' as columns
corr_unstacked.columns = ['variable1', 'variable2', 'correlation']# Rename the columns for better understanding
corr_unstacked = corr_unstacked[corr_unstacked['variable1'] != corr_unstacked['variable2']] # Remove self-correlations by filtering out rows where variable1 == variable2
corr_unstacked = corr_unstacked.drop_duplicates(subset=['correlation']) # Drop duplicates to keep only one entry per variable pair
sorted_corr = corr_unstacked.sort_values(by='correlation', key=abs, ascending=False) # Sort the DataFrame by the absolute value of correlation
#sorted_corr # Display the sorted correlation table
# Define a function to categorize the correlation level
def categorize_correlation(correlation):
abs_corr = abs(correlation) * 100 # Convert to percentage for easier comparison
if abs_corr < 30:
return 'Negligible'
elif 30 <= abs_corr < 50:
return 'Low'
elif 50 <= abs_corr < 70:
return 'Moderate'
elif 70 <= abs_corr < 90:
return 'High'
else:
return 'Very High'
# Apply the function to create the corr_lvl column
sorted_corr['corr_lvl'] = sorted_corr['correlation'].apply(categorize_correlation)
sorted_corr['corr_lvl'].value_counts()
corr_lvl Low 5 Moderate 4 Negligible 1 Name: count, dtype: int64
sorted_corr
| variable1 | variable2 | correlation | corr_lvl | |
|---|---|---|---|---|
| 3 | price | square_meters | 0.651214 | Moderate |
| 8 | rooms | square_meters | 0.635150 | Moderate |
| 13 | bathroom | square_meters | 0.608319 | Moderate |
| 2 | price | bathroom | 0.503768 | Moderate |
| 7 | rooms | bathroom | 0.451065 | Low |
| 9 | rooms | square_meters_price | -0.416305 | Low |
| 19 | square_meters | square_meters_price | -0.391874 | Low |
| 4 | price | square_meters_price | 0.381253 | Low |
| 1 | price | rooms | 0.304009 | Low |
| 14 | bathroom | square_meters_price | -0.111716 | Negligible |
- There are no couple of variables with high correlation (>75%)
boxplot_by_group(df6, 'neighborhood', 'price', False, df_name="(prepared data)")
Top 3 neighborhood by price mean value are: Sarria-Sant Gervasi,Eixample,Les Corts
boxplot_by_group(df6, 'neighborhood', 'square_meters', False, df_name="(prepared data)")
Top 3 neighborhood by square_meters mean value are: Eixample,Sarria-Sant Gervasi,Les Corts
boxplot_by_group(df6, 'neighborhood', 'square_meters_price', False, df_name="(prepared data)")
Top 3 neighborhood by square_meters_price mean value are: Ciutat Vella,Sarria-Sant Gervasi,Eixample
boxplot_by_group(df6, 'real_state', 'price', False, df_name="(prepared data)")
Top 3 real_state by price mean value are: apartment,attic,flat
boxplot_by_group(df6, 'real_state', 'square_meters', False, df_name="(prepared data)")
Top 3 real_state by square_meters mean value are: flat,attic,apartment
boxplot_by_group(df6, 'real_state', 'square_meters_price', False, df_name="(prepared data)")
Top 3 real_state by square_meters_price mean value are: apartment,study,attic
- Top 3 neighborhood by price mean value are: Sarria-Sant Gervasi,Eixample,Les Corts
- Top 3 neighborhood by square_meters mean value are: Eixample,Sarria-Sant Gervasi,Les Corts
- Top 3 neighborhood by square_meters_price mean value are: Ciutat Vella,Sarria-Sant Gervasi,Eixample
- Top 3 real_state by price mean value are: apartment,attic,flat
- Top 3 real_state by square_meters mean value are: flat,attic,apartment
- Top 3 real_state by square_meters_price mean value are: apartment,study,attic
- From the perspective of price per square meter, the most attractive type of unit according to this data could be the flat, with an average surface area of 74.62 m2 (just over the average 73.41 m2) and a price per square meter of 15.44 below the average (16.14)
plot_crosstab_heat_perc(df6, var_interest='real_state',df_name="prepared data")
- There are 3544 flats in Eixample, being the most popular unit type and neighborhood combination, with 85.32% of the units in Eixample being flats, and 29.05% of all flats are located at Eixample.
- Across all neighborhoods, the unit type "flat" is the most popular with at least 85.32% of units by neighborhood
plot_crosstab_bar_count(df6, var_interest='lift')
- Most types of units have a lift, in the case of flats the proportion is 71%
plot_crosstab_bar_count(df6, var_interest='terrace')
- Units with a terrace on the other hand, seem to be rare and very few have one
Exploratory Data Analysis Consolidated Notes¶
Functions
- univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
- univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
- plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
- plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
- boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category
- side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2): to present two side by side boxplot_by_group
Univariate Analysis
- 'price', 'square_meters' and 'square_meters_price' variables are right skewed and reflect the effect of capping outliers to upper whysker.
- Comparing original data (df) vs. prepared data (df6) is noticeable how in original data the numerical variables have float type values and many outliers, while in prepared data the numerical variables have integer values and no outliers.
- In the prepared data there are flats units with 4 rooms and 135m2 area.
- In the original data there are flats units with 10.754 rooms and 679m2 area.
- The "large flats" units in the data are asummed as unreal/not-valid data and are affected by Data Preparation.
- The categorical variables are not balanced, with 85.5% of properties as "flats" and 78.5% of units concentrated in 50% of the sample neighbourhoods
- 75% of flats units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85.48m2.
- 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 80.21m2.
Bivariate Analysis
- There are no couple of variables with high correlation (>75%)
- Top 3 neighborhood by price mean value are: Sarria-Sant Gervasi,Eixample,Les Corts
- Top 3 neighborhood by square_meters mean value are: Sarria-Sant Gervasi,Les Corts,Eixample
- Top 3 neighborhood by square_meters_price mean value are: Ciutat Vella,Sarria-Sant Gervasi,Eixample
- Top 3 real_state by price mean value are: apartment,attic,flat
- Top 3 real_state by square_meters mean value are: flat,attic,apartment
- Top 3 real_state by square_meters_price mean value are: apartment,study,attic
- From the perspective of price per square meter, the most attractive type of unit according to this data could be the flat, with an average surface area of 74.62 m2 (just over the average 73.41 m2) and a price per square meter of 15.44 below the average (16.14)
- There are 3544 flats in Eixample, being the most popular unit type and neighborhood combination, with 85.32% of the units in Eixample being flats, and 29.05% of all flats are located at Eixample.
- Across all neighborhoods, the unit type "flat" is the most popular with at least 85.32% of units by neighborhood
- Most types of units have a lift, in the case of flats the proportion is 71%
- Units with a terrace on the other hand, seem to be rare and very few have one
5. Modeling¶
Selecting and applying appropriate machine learning or statistical models. This step includes training, validating, and fine-tuning models to optimize their performance
Modeling Functions¶
# Define a function to evaluate and return the model's metrics
def evaluate_model(model, x_test, y_test):
y_pred = model.predict(x_test)
metrics = {
"MAE": mean_absolute_error(y_test, y_pred),
"MSE": mean_squared_error(y_test, y_pred),
"RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
"R2 Score": r2_score(y_test, y_pred)
}
return metrics
def evaluate_models_with_cv(models, X_train, y_train, X_test, y_test):
"""
Evaluates multiple regression models using cross-validation and final test set performance.
Parameters:
models: list of tuples (model_name, model_instance)
X_train, y_train: training data
X_test, y_test: test data
Returns:
- results_df: DataFrame containing CV and test metrics for each model
- trained_models: Dictionary of trained models for future use
"""
results_list = [] # List to store model results
trained_models = {} # Dictionary to store trained models
# Define 5-fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
for name, model in models:
# Perform cross-validation on training set
cv_results = cross_validate(
model, X_train, y_train,
scoring=["neg_mean_absolute_error", "neg_mean_squared_error", "r2"],
cv=kfold, return_train_score=False
)
# Extract mean values and convert negatives to positives
train_mae = -cv_results["test_neg_mean_absolute_error"].mean()
train_mse = -cv_results["test_neg_mean_squared_error"].mean()
train_rmse = np.sqrt(train_mse)
train_r2 = cv_results["test_r2"].mean()
# Append CV results to list
results_list.append({
"Model": f"{name}_CV",
"MAE": train_mae,
"MSE": train_mse,
"RMSE": train_rmse,
"R2 Score": train_r2
})
# Train model on full training data and evaluate on test set
model.fit(X_train, y_train)
trained_models[name] = model # Store trained model
y_pred = model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
test_mse = mean_squared_error(y_test, y_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_pred)
# Append test set results to list
results_list.append({
"Model": f"{name}_Test",
"MAE": test_mae,
"MSE": test_mse,
"RMSE": test_rmse,
"R2 Score": test_r2
})
# Convert results list to DataFrame
results_cv = pd.DataFrame(results_list)
return results_cv, trained_models
def univariate_numerical_y(y):
"""
Function to generate two plots for the numerical variable y:
- Histogram for variable distribution
- Boxplot for statistical summary
"""
# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Histogram
axes[0].hist(y, bins=30, color='blue', alpha=0.7)
axes[0].set_title('Histogram of y')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')
# Boxplot
axes[1].boxplot(y, vert=False)
axes[1].set_title('Boxplot of y')
axes[1].set_xlabel('Value')
plt.tight_layout()
plt.show()
- Defined function "evaluate_model(model, x_test, y_test)", to evaluate and return the model's metrics into a results dataframe
- Defined function "evaluate_models_with_cv(models, X_train, y_train, X_test, y_test)" to evaluates multiple regression models using cross-validation and final test set performance.
- Defined function "univariate_numerical_y()", to generate two plots (Histogram and Boxplot) for the numerical variable y
Preparing data for modeling¶
data=df6.copy()
- Modeling data (data) will be done over a copy of prepared data (df6)
# 1. Specify independent (X) and dependent (y) variables
X = data.drop(["price"], axis=1)
y = data["price"]
# 2. Create dummy variables for categorical features
X = pd.get_dummies(X, columns=['real_state', 'neighborhood'], drop_first=True) # drop_first=True to avoid multicollinearity
# 3. Convert boolean columns to numeric (0 and 1)
bool_cols = X.select_dtypes(['bool'])
for col in bool_cols.columns:
X[col] = X[col].astype('int')
# 4. Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# 5. Transform and scale right-skewed variables (applied **only to training data to avoid data leakage**)
pt = PowerTransformer(method='yeo-johnson') # Works with zero/negative values
# Fit only on training data, then transform both training and test data
X_train[['square_meters', 'square_meters_price']] = pt.fit_transform(X_train[['square_meters', 'square_meters_price']])
X_test[['square_meters', 'square_meters_price']] = pt.transform(X_test[['square_meters', 'square_meters_price']]) # Transform only
# 6. Standardize the transformed numerical features (again, to prevent data leakage)
scaler = StandardScaler()
X_train[['square_meters', 'square_meters_price']] = scaler.fit_transform(X_train[['square_meters', 'square_meters_price']])
X_test[['square_meters', 'square_meters_price']] = scaler.transform(X_test[['square_meters', 'square_meters_price']]) # Use the same scaler
# 7. Add a constant to independent variables (after scaling, only for models that need it)
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
# Checking training and test sets.
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
Shape of Training set : (9988, 19) Shape of test set : (4281, 19)
X_train.head()
| const | rooms | bathroom | lift | terrace | square_meters | square_meters_price | real_state_attic | real_state_flat | real_state_study | neighborhood_Eixample | neighborhood_Les Corts | neighborhood_Sant Martí | neighborhood_Ciutat Vella | neighborhood_Gràcia | neighborhood_Sants-Montjuïc | neighborhood_Sant Andreu | neighborhood_Horta- Guinardo | neighborhood_Nou Barris | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16047 | 1.0 | 2 | 1 | 1 | 0 | 0.163420 | 0.503709 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10334 | 1.0 | 1 | 1 | 1 | 0 | -1.500990 | 1.774314 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10144 | 1.0 | 2 | 1 | 0 | 0 | -1.953426 | 0.850089 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8401 | 1.0 | 3 | 1 | 1 | 1 | 1.666115 | -0.559885 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2041 | 1.0 | 3 | 1 | 0 | 1 | -0.098884 | -1.895277 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
X_train.info()
<class 'pandas.core.frame.DataFrame'> Index: 9988 entries, 16047 to 15324 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 const 9988 non-null float64 1 rooms 9988 non-null int64 2 bathroom 9988 non-null int64 3 lift 9988 non-null int64 4 terrace 9988 non-null int64 5 square_meters 9988 non-null float64 6 square_meters_price 9988 non-null float64 7 real_state_attic 9988 non-null int64 8 real_state_flat 9988 non-null int64 9 real_state_study 9988 non-null int64 10 neighborhood_Eixample 9988 non-null int64 11 neighborhood_Les Corts 9988 non-null int64 12 neighborhood_Sant Martí 9988 non-null int64 13 neighborhood_Ciutat Vella 9988 non-null int64 14 neighborhood_Gràcia 9988 non-null int64 15 neighborhood_Sants-Montjuïc 9988 non-null int64 16 neighborhood_Sant Andreu 9988 non-null int64 17 neighborhood_Horta- Guinardo 9988 non-null int64 18 neighborhood_Nou Barris 9988 non-null int64 dtypes: float64(3), int64(16) memory usage: 1.5 MB
- The dataset contains numerical features with different scales, which may affect algorithms sensitive to scale.
- Several models will be tried, including models that rely on distance-based calculations (Logistic Regression, SVM, KNN) that perform better with standardized data, and also linear models (Linear/Logistic Regression, Ridge, Lasso) that can converge faster with standardized inputs.
- Due the different scales, and models to be evaluated, the data will be standarized:
- 'price' is the target variable. Standardizing the target (y) is not necessary for most regression models
- 'rooms' and 'bathrooms' show a discrete distribution, which has peaks at certain integer values. No scaling considered.
- cathegorical or binary variables such as 'lift' , 'terrace', 'real_state' and 'neighborhood' do not need scaling.
- 'square_meters' and 'square_meters_price' have right-skewed distributions and will be transformed using PowerTransformer (Yeo-Johnson) before applying StandardScaler.
univariate_numerical(X_train)
univariate_numerical_y(y_train)
Modeling Consolidated Notes¶
- Defined function "evaluate_model(model, x_test, y_test)", to evaluate and return the model's metrics into a results dataframe
- Defined function "evaluate_models_with_cv(models, X_train, y_train, X_test, y_test)" to evaluates multiple regression models using cross-validation and final test set performance.- Defined function "univariate_numerical_y()", to generate two plots (Histogram and Boxplot) for the numerical variable y
- Modeling data (data) will be done over a copy of prepared data (df6)
- The dataset contains numerical features with different scales, which may affect algorithms sensitive to scale.
- Several models will be tried, including models that rely on distance-based calculations (Logistic Regression, SVM, KNN) that perform better with standardized data, and also linear models (Linear/Logistic Regression, Ridge, Lasso) that can converge faster with standardized inputs.
- Due the different scales, and models to be evaluated, the data will be standarized:
- 'price' is the target variable. Standardizing the target (y) is not necessary for most regression models
- 'rooms' and 'bathrooms' show a discrete distribution, which has peaks at certain integer values. No scaling considered.
- cathegorical or binary variables such as 'lift' , 'terrace', 'real_state' and 'neighborhood' do not need scaling.
- 'square_meters' and 'square_meters_price' have right-skewed distributions and will be transformed using PowerTransformer (Yeo-Johnson) before applying StandardScaler.
6. Evaluation¶
Assessing the model's performance using metrics such as accuracy, precision, recall, or others relevant to the project. Ensuring the model meets the required standards for deployment.
Regression Models¶
# Define a dictionary of regression models
regression_models = {
"Linear Regression": LinearRegression(),
"Lasso Regression": Lasso(),
"Ridge Regression": Ridge(),
"Decision Tree": DecisionTreeRegressor(),
"Random Forest": RandomForestRegressor(),
"K-Nearest Neighbors": KNeighborsRegressor(),
"Support Vector Regressor": SVR()
}
- Models to be tested are : Linear Regression, Lasso Regression, Ridge Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and Support Vector Regressor
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
- Performance Metrics:
- MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
- MSE (Mean Squared Error): Measures the average of the squares of the errors, giving more weight to larger errors.
- RMSE (Root Mean Squared Error): The square root of MSE, providing error in the same units as the target variable.
- R2 Score (Coefficient of Determination): Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 15.5 s Wall time: 16.3 s
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 39.785481 | 4165.852534 | 64.543416 | 0.970178 |
| 3 | Decision Tree | 50.995982 | 7282.203392 | 85.335827 | 0.947870 |
| 2 | Ridge Regression | 67.054647 | 9179.684362 | 95.810669 | 0.934286 |
| 0 | Linear Regression | 67.057076 | 9180.541068 | 95.815140 | 0.934280 |
| 1 | Lasso Regression | 67.493485 | 9249.945144 | 96.176635 | 0.933783 |
| 5 | K-Nearest Neighbors | 74.951133 | 11475.952189 | 107.125871 | 0.917848 |
| 6 | Support Vector Regressor | 98.720296 | 23091.488357 | 151.958838 | 0.834697 |
results_df.sort_values(by="MAE")
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 39.785481 | 4165.852534 | 64.543416 | 0.970178 |
| 3 | Decision Tree | 50.995982 | 7282.203392 | 85.335827 | 0.947870 |
| 2 | Ridge Regression | 67.054647 | 9179.684362 | 95.810669 | 0.934286 |
| 0 | Linear Regression | 67.057076 | 9180.541068 | 95.815140 | 0.934280 |
| 1 | Lasso Regression | 67.493485 | 9249.945144 | 96.176635 | 0.933783 |
| 5 | K-Nearest Neighbors | 74.951133 | 11475.952189 | 107.125871 | 0.917848 |
| 6 | Support Vector Regressor | 98.720296 | 23091.488357 | 151.958838 | 0.834697 |
results_df.sort_values(by="MSE")
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 39.785481 | 4165.852534 | 64.543416 | 0.970178 |
| 3 | Decision Tree | 50.995982 | 7282.203392 | 85.335827 | 0.947870 |
| 2 | Ridge Regression | 67.054647 | 9179.684362 | 95.810669 | 0.934286 |
| 0 | Linear Regression | 67.057076 | 9180.541068 | 95.815140 | 0.934280 |
| 1 | Lasso Regression | 67.493485 | 9249.945144 | 96.176635 | 0.933783 |
| 5 | K-Nearest Neighbors | 74.951133 | 11475.952189 | 107.125871 | 0.917848 |
| 6 | Support Vector Regressor | 98.720296 | 23091.488357 | 151.958838 | 0.834697 |
results_df.sort_values(by="RMSE")
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | Random Forest | 39.785481 | 4165.852534 | 64.543416 | 0.970178 |
| 3 | Decision Tree | 50.995982 | 7282.203392 | 85.335827 | 0.947870 |
| 2 | Ridge Regression | 67.054647 | 9179.684362 | 95.810669 | 0.934286 |
| 0 | Linear Regression | 67.057076 | 9180.541068 | 95.815140 | 0.934280 |
| 1 | Lasso Regression | 67.493485 | 9249.945144 | 96.176635 | 0.933783 |
| 5 | K-Nearest Neighbors | 74.951133 | 11475.952189 | 107.125871 | 0.917848 |
| 6 | Support Vector Regressor | 98.720296 | 23091.488357 | 151.958838 | 0.834697 |
- Random Forest metrics: Lowest MAE, lowest RMSE, and highest R².
- Random Forest is the best performer overall, indicating strong predictive accuracy and low error.
- Decision Tree metrics: Moderate errors with a good R².
- Decision Tree is a strong candidate, although slightly behind Random Forest.
- Ridge, Linear, and Lasso Regression metrics are consistent with each other, but their performance is noticeably lower than the tree-based methods. They might not be ideal for further tuning if the goal is the best predictive performance.
- For hyperparameter tuning and further validation, Random Forest and Decision Tree stand out as the best candidates due to their superior performance metrics.
- While the linear models (Ridge, Linear, and Lasso) can serve as strong baselines, they do not match the predictive accuracy of the tree-based models.
- K-Nearest Neighbors and SVR appear less promising for further development on this dataset.
Feature Engineering¶
# Define the model with the selected hyperparameters
RandomForest = RandomForestRegressor()
# Train the model on the entire training dataset
RandomForest.fit(X_train, y_train)
# Feature importance
feature_importances = pd.Series(RandomForest.feature_importances_, index=X_train.columns)
feature_importances = feature_importances.sort_values(ascending=False)
# Plotting
plt.figure(figsize=(10, 6))
feature_importances.plot(kind='bar')
plt.title('Feature Importance')
plt.xlabel('Features')
plt.ylabel('Importance Score')
plt.show()
- From the feature importance plot, square_meters is the most significant variable, followed by square_meters_price.
- Since price is directly derived from square_meters * square_meters_price, including both may not add new information and could introduce redundancy.
- It makes no sence to ask end user square_meters and square_meters_price to "predict" price.
- NEW MODELS will be evaluated, with the feature square_meters_price DROPED from the data
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])
vif_series = pd.Series(
[variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
index=X_train_vif.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: rooms 11.080074 bathroom 11.296229 lift 3.449079 terrace 1.346301 square_meters 2.099052 square_meters_price 1.568111 real_state_attic 1.352480 real_state_flat 9.137370 real_state_study 1.165991 neighborhood_Eixample 2.586087 neighborhood_Les Corts 1.328271 neighborhood_Sant Martí 1.425954 neighborhood_Ciutat Vella 1.887352 neighborhood_Gràcia 1.483803 neighborhood_Sants-Montjuïc 1.441272 neighborhood_Sant Andreu 1.144297 neighborhood_Horta- Guinardo 1.242755 neighborhood_Nou Barris 1.077936 dtype: float64
- Although its VIF (1.568) is low (suggesting no strong collinearity within the dataset), the mathematical dependence between square_meters and square_meters_price suggests redundancy.
- This means the model could overestimate the importance of one feature over another and lead to unstable coefficient estimates.
- By keeping only square_meters, the model remains more interpretable, focusing on how space affects price rather than a derived variable.
- Noted features 'rooms' and 'bathroom' present high multicolinearity and will be also droped from modeling
def preprocess_data(data, target_feature, drop_features, scale_features, test_size=0.30, random_state=1):
"""
Preprocesses the dataset by handling categorical variables, boolean conversion,
splitting data, transforming skewed features, standardizing, and adding a constant.
Parameters:
- data: DataFrame containing the full dataset.
- target_feature: Name of the dependent variable.
- drop_features: List of features to drop from the dataset.
- scale_features: List of numerical features to transform and scale.
- test_size: Proportion of the dataset to include in the test split.
- random_state: Seed for reproducibility.
Returns:
- X_train, X_test, y_train, y_test: Processed training and test datasets.
"""
# 1. Specify independent (X) and dependent (y) variables
X = data.drop(drop_features, axis=1)
y = data[target_feature]
# 2. Create dummy variables for categorical features
categorical_features = ['real_state', 'neighborhood']
X = pd.get_dummies(X, columns=categorical_features, drop_first=True)
# 3. Convert boolean columns to numeric (0 and 1)
bool_cols = X.select_dtypes(['bool']).columns
X[bool_cols] = X[bool_cols].astype(int)
# 4. Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
# 5. Transform and scale right-skewed variables (PowerTransformer for skewed data)
pt = PowerTransformer(method='yeo-johnson')
X_train[scale_features] = pt.fit_transform(X_train[scale_features])
X_test[scale_features] = pt.transform(X_test[scale_features])
# 6. Standardize the transformed numerical features
scaler = StandardScaler()
X_train[scale_features] = scaler.fit_transform(X_train[scale_features])
X_test[scale_features] = scaler.transform(X_test[scale_features])
# 7. Add a constant to independent variables (after scaling)
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
return X_train, X_test, y_train, y_test
- Defined function "preprocess_data(data, target_feature, drop_features, scale_features, test_size=0.30, random_state=1)", to iterate on the data preparation for modeling
X_train, X_test, y_train, y_test= preprocess_data(data, ['price'], ['price','square_meters_price'], ['square_meters'], test_size=0.30, random_state=1)
- Data preparation droping the feature square_meters_price
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 13.6 s Wall time: 13.8 s
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 192.829593 | 62956.269056 | 250.910879 | 0.549322 |
| 2 | Ridge Regression | 192.833577 | 62957.371541 | 250.913076 | 0.549314 |
| 1 | Lasso Regression | 193.843038 | 63504.450907 | 252.000895 | 0.545397 |
| 4 | Random Forest | 188.532068 | 64705.626597 | 254.373007 | 0.536799 |
| 5 | K-Nearest Neighbors | 193.531091 | 67397.543929 | 259.610369 | 0.517528 |
| 6 | Support Vector Regressor | 210.787271 | 86373.521736 | 293.893725 | 0.381687 |
| 3 | Decision Tree | 219.420021 | 95376.122393 | 308.830249 | 0.317241 |
- Linear Regression and Ridge Regression performed the best in terms of R² Score
- Feature selection will be performed to reduce multicollinearity.
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])
vif_series = pd.Series(
[variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
index=X_train_vif.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: rooms 10.837814 bathroom 10.048749 lift 3.432845 terrace 1.338510 square_meters 1.609008 real_state_attic 1.348520 real_state_flat 8.486485 real_state_study 1.147954 neighborhood_Eixample 2.580478 neighborhood_Les Corts 1.328092 neighborhood_Sant Martí 1.425753 neighborhood_Ciutat Vella 1.887234 neighborhood_Gràcia 1.483735 neighborhood_Sants-Montjuïc 1.439393 neighborhood_Sant Andreu 1.141826 neighborhood_Horta- Guinardo 1.235647 neighborhood_Nou Barris 1.075084 dtype: float64
X_train, X_test, y_train, y_test= preprocess_data(data, ['price'], ['price','square_meters_price','rooms'], ['square_meters'], test_size=0.30, random_state=1)
- Data preparation droping the feature 'rooms' due high multicolinearity
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 13.2 s Wall time: 13.5 s
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 197.545690 | 65051.360633 | 255.051682 | 0.534324 |
| 2 | Ridge Regression | 197.554341 | 65051.439986 | 255.051838 | 0.534323 |
| 1 | Lasso Regression | 198.916759 | 65589.695703 | 256.104853 | 0.530470 |
| 5 | K-Nearest Neighbors | 198.173137 | 69939.496099 | 264.460765 | 0.499331 |
| 4 | Random Forest | 198.968127 | 71029.109970 | 266.512870 | 0.491531 |
| 6 | Support Vector Regressor | 211.486036 | 86079.067618 | 293.392344 | 0.383795 |
| 3 | Decision Tree | 227.006181 | 98680.824283 | 314.135041 | 0.293584 |
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])
vif_series = pd.Series(
[variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
index=X_train_vif.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: bathroom 8.222008 lift 3.418865 terrace 1.338489 square_meters 1.354581 real_state_attic 1.323041 real_state_flat 7.304896 real_state_study 1.144434 neighborhood_Eixample 2.459688 neighborhood_Les Corts 1.300169 neighborhood_Sant Martí 1.382281 neighborhood_Ciutat Vella 1.849081 neighborhood_Gràcia 1.440009 neighborhood_Sants-Montjuïc 1.393866 neighborhood_Sant Andreu 1.121367 neighborhood_Horta- Guinardo 1.201621 neighborhood_Nou Barris 1.063670 dtype: float64
- After removing feature 'rooms' still Linear Regression and Ridge Regression performed the best in terms of R² Score, but also remains features with high multicolinearity
X_train, X_test, y_train, y_test= preprocess_data(data, ['price'], ['price','square_meters_price','rooms','bathroom'], ['square_meters'], test_size=0.30, random_state=1)
- Data preparation droping the feature 'bahtroom' due high multicolinearity
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 13.5 s Wall time: 14.5 s
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 2 | Ridge Regression | 203.954404 | 68258.459986 | 261.263201 | 0.511365 |
| 0 | Linear Regression | 203.946460 | 68259.489536 | 261.265171 | 0.511358 |
| 1 | Lasso Regression | 204.980574 | 68760.982886 | 262.223155 | 0.507768 |
| 5 | K-Nearest Neighbors | 205.304975 | 74808.167353 | 273.510818 | 0.464479 |
| 4 | Random Forest | 203.990067 | 74902.742779 | 273.683655 | 0.463802 |
| 6 | Support Vector Regressor | 216.002731 | 89473.683618 | 299.121520 | 0.359494 |
| 3 | Decision Tree | 235.145948 | 106969.349675 | 327.061691 | 0.234250 |
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])
vif_series = pd.Series(
[variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
index=X_train_vif.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: lift 3.240716 terrace 1.335683 square_meters 1.100341 real_state_attic 1.259749 real_state_flat 5.435033 real_state_study 1.099832 neighborhood_Eixample 2.160766 neighborhood_Les Corts 1.248794 neighborhood_Sant Martí 1.317437 neighborhood_Ciutat Vella 1.622886 neighborhood_Gràcia 1.365185 neighborhood_Sants-Montjuïc 1.317733 neighborhood_Sant Andreu 1.107205 neighborhood_Horta- Guinardo 1.170314 neighborhood_Nou Barris 1.056992 dtype: float64
- Remains the feature real_state_flat with VIF>5
- Since "flat" is the most frequent category across neighborhoods, it might be highly correlated with certain neighborhood variables.
- Instead of removing real_state_flat, it will be considered as the Baseline Category for real_state
def preprocess_data(data, target_feature, drop_features, scale_features, categorical_features, baseline_categories, test_size=0.30, random_state=1):
"""
Preprocesses the dataset by handling categorical variables, boolean conversion,
splitting data, transforming skewed features, standardizing, and adding a constant.
Parameters:
- data: DataFrame containing the full dataset.
- target_feature: Name of the dependent variable.
- drop_features: List of features to drop.
- scale_features: List of numerical features to transform and scale.
- categorical_features: List of categorical features to encode.
- baseline_categories: Dictionary specifying baseline category for each categorical variable.
- test_size: Proportion of the dataset to include in the test split.
- random_state: Seed for reproducibility.
Returns:
- X_train, X_test, y_train, y_test: Processed training and test datasets.
"""
# 1. Specify independent (X) and dependent (y) variables
X = data.drop([target_feature] + drop_features, axis=1)
y = data[target_feature]
# 2. Create dummy variables for categorical features with specified baseline categories
X = pd.get_dummies(X, columns=categorical_features, drop_first=False)
for feature, baseline in baseline_categories.items():
if f"{feature}_{baseline}" in X.columns:
X.drop(columns=[f"{feature}_{baseline}"], inplace=True)
# 3. Convert boolean columns to numeric (0 and 1)
bool_cols = X.select_dtypes(['bool']).columns
X[bool_cols] = X[bool_cols].astype(int)
# 4. Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
# 5. Transform and scale right-skewed variables (PowerTransformer for skewed data)
pt = PowerTransformer(method='yeo-johnson')
X_train[scale_features] = pt.fit_transform(X_train[scale_features])
X_test[scale_features] = pt.transform(X_test[scale_features])
# 6. Standardize the transformed numerical features
scaler = StandardScaler()
X_train[scale_features] = scaler.fit_transform(X_train[scale_features])
X_test[scale_features] = scaler.transform(X_test[scale_features])
# 7. Add a constant to independent variables (after scaling)
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
return X_train, X_test, y_train, y_test
- Modified preprocess_data function to control one-hot encoding category to drop
plot_crosstab_heat_perc(df6, var_interest='real_state',df_name="prepared data")
- Selected real_state_flat and neighborhood_Eixample as the base line categories for one-hot encoding
X_train, X_test, y_train, y_test = preprocess_data(
data=data,
target_feature="price",
drop_features=["price", "square_meters_price", "rooms", "bathroom"],
scale_features=["square_meters"],
categorical_features=["real_state", "neighborhood"],
baseline_categories={"real_state": "flat", "neighborhood": "Eixample"},
test_size=0.30,
random_state=1
)
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 15.8 s Wall time: 18.2 s
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 203.946460 | 68259.489536 | 261.265171 | 0.511358 |
| 2 | Ridge Regression | 203.959557 | 68260.799293 | 261.267677 | 0.511349 |
| 1 | Lasso Regression | 205.169206 | 68709.866263 | 262.125669 | 0.508134 |
| 5 | K-Nearest Neighbors | 204.794534 | 74634.435609 | 273.193037 | 0.465722 |
| 4 | Random Forest | 203.508139 | 74730.319708 | 273.368469 | 0.465036 |
| 6 | Support Vector Regressor | 216.052970 | 89802.139881 | 299.670052 | 0.357143 |
| 3 | Decision Tree | 234.658293 | 106104.225463 | 325.736436 | 0.240443 |
# Drop the constant column
X_train_vif = X_train.drop(columns=['const'])
vif_series = pd.Series(
[variance_inflation_factor(X_train_vif.values, i) for i in range(X_train_vif.shape[1])],
index=X_train_vif.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: lift 2.020784 terrace 1.325467 square_meters 1.103906 real_state_apartment 1.082000 real_state_attic 1.093367 real_state_study 1.049390 neighborhood_Sarria-Sant Gervasi 1.255990 neighborhood_Les Corts 1.098590 neighborhood_Sant Martí 1.121619 neighborhood_Ciutat Vella 1.207858 neighborhood_Gràcia 1.118434 neighborhood_Sants-Montjuïc 1.105899 neighborhood_Sant Andreu 1.037034 neighborhood_Horta- Guinardo 1.053019 neighborhood_Nou Barris 1.018164 dtype: float64
- There is no multicolinearity in the data, suggesting the real state distribution in terms of number of rooms and bathrooms is not as relevant as the real state area, type and neighborhood
- Linear Regression and Ridge Regression are the best models among those tested, but R² scores suggest that the models are not explaining a significant portion of the variance in the target variable.
- More advanced models will be included in the evaluation
Advanced Regression Models¶
- Models to be tested are: DecisionTree_Tuned_1, RandomForest_Tuned_1, GradientBoosting_Tuned_1, XGBoost_Tuned_1, LightGBM_Tuned_1, NeuralNetwork(MLP)
# Define a dictionary of regression models
regression_models_2 = {
"DecisionTree_Tuned_1": DecisionTreeRegressor(max_depth=10, min_samples_split=5),
"RandomForest_Tuned_1": RandomForestRegressor(max_depth=10, min_samples_split=5, n_estimators=200),
"GradientBoosting_Tuned_1": GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=5),
"XGBoost_Tuned_1": xgb.XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=5),
"LightGBM_Tuned_1": lgb.LGBMRegressor(n_estimators=200, learning_rate=0.1, max_depth=5, verbose=-1),
#"CatBoost": catb.CatBoostRegressor(iterations=200, learning_rate=0.1, depth=5, verbose=0),
"NeuralNetwork(MLP)": MLPRegressor(hidden_layer_sizes=(100,), activation='relu', solver='adam', max_iter=500)
}
# Initialize an empty DataFrame to store results
results_df_2 = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models_2.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df_2 = pd.concat([results_df_2, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 19.7 s Wall time: 19.5 s
# Display the results DataFrame
results_df_2.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 2 | GradientBoosting_Tuned_1 | 190.979607 | 63554.668646 | 252.100513 | 0.545038 |
| 4 | LightGBM_Tuned_1 | 191.308692 | 63653.691737 | 252.296833 | 0.544329 |
| 3 | XGBoost_Tuned_1 | 191.172298 | 63903.217757 | 252.790858 | 0.542543 |
| 1 | RandomForest_Tuned_1 | 192.255939 | 64194.833169 | 253.366993 | 0.540455 |
| 5 | NeuralNetwork(MLP) | 201.536209 | 66683.063761 | 258.230641 | 0.522643 |
| 0 | DecisionTree_Tuned_1 | 197.879923 | 70119.813941 | 264.801461 | 0.498041 |
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 203.946460 | 68259.489536 | 261.265171 | 0.511358 |
| 2 | Ridge Regression | 203.959557 | 68260.799293 | 261.267677 | 0.511349 |
| 1 | Lasso Regression | 205.169206 | 68709.866263 | 262.125669 | 0.508134 |
| 5 | K-Nearest Neighbors | 204.794534 | 74634.435609 | 273.193037 | 0.465722 |
| 4 | Random Forest | 203.508139 | 74730.319708 | 273.368469 | 0.465036 |
| 6 | Support Vector Regressor | 216.052970 | 89802.139881 | 299.670052 | 0.357143 |
| 3 | Decision Tree | 234.658293 | 106104.225463 | 325.736436 | 0.240443 |
- The best R2 score from the advanced models is currently 0.5450 with the Gradient Boosting model.
- Improving from 0.5113 Linear Regression could be a good start, but could potentially be improved further with model tuning
Model Tuning¶
def tune_gradient_boosting():
print("Tuning Gradient Boosting...")
param_grid = {
'n_estimators': [100, 200, 300, 500],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 5, 7, 9],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'subsample': [0.8, 0.9, 1.0],
'max_features': ['sqrt', 'log2', None]
}
gb = GradientBoostingRegressor(random_state=42)
grid_search = RandomizedSearchCV(
estimator=gb,
param_distributions=param_grid,
n_iter=20,
cv=5,
scoring='r2',
n_jobs=-1,
random_state=42,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best R2 score: {grid_search.best_score_:.4f}")
best_gb = grid_search.best_estimator_
return best_gb
tune_gradient_boosting()
Tuning Gradient Boosting...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters: {'subsample': 0.8, 'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': None, 'max_depth': 3, 'learning_rate': 0.1}
Best R2 score: 0.5434
GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,
random_state=42, subsample=0.8)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,
random_state=42, subsample=0.8)def tune_xgboost():
print("Tuning XGBoost...")
param_grid = {
'n_estimators': [100, 200, 300, 500],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 5, 7, 9],
'min_child_weight': [1, 3, 5, 7],
'gamma': [0, 0.1, 0.2, 0.3],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'reg_alpha': [0, 0.1, 1, 10],
'reg_lambda': [0, 1, 5, 10]
}
xgb_model = xgb.XGBRegressor(random_state=42)
grid_search = RandomizedSearchCV(
estimator=xgb_model,
param_distributions=param_grid,
n_iter=20,
cv=5,
scoring='r2',
n_jobs=-1,
random_state=42,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best R2 score: {grid_search.best_score_:.4f}")
best_xgb = grid_search.best_estimator_
return best_xgb
tune_xgboost()
Tuning XGBoost... Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters: {'subsample': 0.6, 'reg_lambda': 0, 'reg_alpha': 10, 'n_estimators': 500, 'min_child_weight': 3, 'max_depth': 3, 'learning_rate': 0.05, 'gamma': 0.2, 'colsample_bytree': 1.0}
Best R2 score: 0.5456
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=1.0, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.2, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=3, max_leaves=None,
min_child_weight=3, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=500, n_jobs=None,
num_parallel_tree=None, random_state=42, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=1.0, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.2, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=3, max_leaves=None,
min_child_weight=3, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=500, n_jobs=None,
num_parallel_tree=None, random_state=42, ...)def tune_lightgbm():
print("Tuning LightGBM...")
param_grid = {
'n_estimators': [100, 200, 300, 500],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 5, 7, 9, -1],
'num_leaves': [31, 50, 100, 150],
'min_child_samples': [5, 10, 20, 50],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'reg_alpha': [0, 0.1, 1, 10],
'reg_lambda': [0, 1, 5, 10]
}
lgb_model = lgb.LGBMRegressor(random_state=42, verbose=-1)
grid_search = RandomizedSearchCV(
estimator=lgb_model,
param_distributions=param_grid,
n_iter=20,
cv=5,
scoring='r2',
n_jobs=-1,
random_state=42,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best R2 score: {grid_search.best_score_:.4f}")
best_lgb = grid_search.best_estimator_
return best_lgb
tune_lightgbm()
Tuning LightGBM...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters: {'subsample': 0.6, 'reg_lambda': 1, 'reg_alpha': 10, 'num_leaves': 50, 'n_estimators': 500, 'min_child_samples': 5, 'max_depth': 3, 'learning_rate': 0.05, 'colsample_bytree': 0.6}
Best R2 score: 0.5443
LGBMRegressor(colsample_bytree=0.6, learning_rate=0.05, max_depth=3,
min_child_samples=5, n_estimators=500, num_leaves=50,
random_state=42, reg_alpha=10, reg_lambda=1, subsample=0.6,
verbose=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMRegressor(colsample_bytree=0.6, learning_rate=0.05, max_depth=3,
min_child_samples=5, n_estimators=500, num_leaves=50,
random_state=42, reg_alpha=10, reg_lambda=1, subsample=0.6,
verbose=-1)import optuna
# ===============================================
# Advanced Hyperparameter Tuning with Optuna
# ===============================================
def tune_with_optuna(model_type):
print(f"Tuning {model_type} with Optuna...")
def objective(trial):
if model_type == 'xgboost':
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'gamma': trial.suggest_float('gamma', 0, 1),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
'random_state': 42
}
model = xgb.XGBRegressor(**params)
elif model_type == 'lightgbm':
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'num_leaves': trial.suggest_int('num_leaves', 20, 200),
'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
'random_state': 42,
'verbose': -1
}
model = lgb.LGBMRegressor(**params)
elif model_type == 'gbr':
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
'random_state': 42
}
model = GradientBoostingRegressor(**params)
else:
raise ValueError(f"Unknown model type: {model_type}")
# Use cross-validation for more robust evaluation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='r2')
return scores.mean()
# Create and optimize the study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(f"Best trial: {study.best_trial.number}")
print(f"Best R2 score: {study.best_value:.4f}")
print(f"Best parameters: {study.best_params}")
# Create a model with the best parameters
if model_type == 'xgboost':
best_model = xgb.XGBRegressor(**study.best_params)
elif model_type == 'lightgbm':
best_model = lgb.LGBMRegressor(**study.best_params)
elif model_type == 'gbr':
best_model = GradientBoostingRegressor(**study.best_params)
# Train and evaluate on test set
best_model.fit(X_train, y_train)
#metrics = evaluate_model(best_model, X_train, X_test, y_train, y_test)
#print(f"Test R2 score: {metrics['R2 Score']:.4f}")
return best_model
tune_with_optuna('xgboost')
[I 2025-02-28 18:26:08,821] A new study created in memory with name: no-name-c3bea523-3901-49e5-bc06-10a192dccd4c
Tuning xgboost with Optuna...
[I 2025-02-28 18:26:24,163] Trial 0 finished with value: 0.48917752504348755 and parameters: {'n_estimators': 888, 'learning_rate': 0.13638048024350496, 'max_depth': 8, 'min_child_weight': 8, 'gamma': 0.06434676023791375, 'subsample': 0.7165790736450253, 'colsample_bytree': 0.5121990183798227, 'reg_alpha': 5.746612857926817, 'reg_lambda': 6.923761570557943}. Best is trial 0 with value: 0.48917752504348755.
[I 2025-02-28 18:26:30,704] Trial 1 finished with value: 0.48238654136657716 and parameters: {'n_estimators': 393, 'learning_rate': 0.09384983663772839, 'max_depth': 10, 'min_child_weight': 4, 'gamma': 0.5514728804308722, 'subsample': 0.5981765793087479, 'colsample_bytree': 0.8075428855306195, 'reg_alpha': 9.307261576083217, 'reg_lambda': 7.241394804967402}. Best is trial 0 with value: 0.48917752504348755.
[I 2025-02-28 18:26:34,803] Trial 2 finished with value: 0.5223005294799805 and parameters: {'n_estimators': 407, 'learning_rate': 0.11871603858664527, 'max_depth': 6, 'min_child_weight': 2, 'gamma': 0.6967152026076723, 'subsample': 0.8846920736377173, 'colsample_bytree': 0.5133437318094024, 'reg_alpha': 0.8649785036720237, 'reg_lambda': 5.878119849951487}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:26:40,994] Trial 3 finished with value: 0.50477694272995 and parameters: {'n_estimators': 665, 'learning_rate': 0.15214036925300578, 'max_depth': 7, 'min_child_weight': 8, 'gamma': 0.9031770410225999, 'subsample': 0.681241311018806, 'colsample_bytree': 0.5141291135344452, 'reg_alpha': 4.33395472050038, 'reg_lambda': 9.411524376239278}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:26:45,903] Trial 4 finished with value: 0.5029198408126831 and parameters: {'n_estimators': 700, 'learning_rate': 0.2995516278226636, 'max_depth': 4, 'min_child_weight': 3, 'gamma': 0.1306726528413399, 'subsample': 0.9542975672262177, 'colsample_bytree': 0.747427421816002, 'reg_alpha': 2.987130049785515, 'reg_lambda': 3.2037729939708024}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:26:52,834] Trial 5 finished with value: 0.42719292640686035 and parameters: {'n_estimators': 674, 'learning_rate': 0.29003915104330136, 'max_depth': 6, 'min_child_weight': 3, 'gamma': 0.9274640608616926, 'subsample': 0.988081540402282, 'colsample_bytree': 0.8941856867048612, 'reg_alpha': 0.3263042730315746, 'reg_lambda': 1.4852147107431324}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:27:01,629] Trial 6 finished with value: 0.4879873275756836 and parameters: {'n_estimators': 952, 'learning_rate': 0.15389715502445497, 'max_depth': 6, 'min_child_weight': 5, 'gamma': 0.9090277514328283, 'subsample': 0.897013687505599, 'colsample_bytree': 0.5050074177242521, 'reg_alpha': 4.353580911570742, 'reg_lambda': 1.4369264223904599}. Best is trial 2 with value: 0.5223005294799805.
[I 2025-02-28 18:27:04,869] Trial 7 finished with value: 0.5319819688796997 and parameters: {'n_estimators': 421, 'learning_rate': 0.2022139181760185, 'max_depth': 3, 'min_child_weight': 2, 'gamma': 0.9147586121490527, 'subsample': 0.6467542388459349, 'colsample_bytree': 0.9331485843647656, 'reg_alpha': 5.870927504162713, 'reg_lambda': 7.474268183728644}. Best is trial 7 with value: 0.5319819688796997.
[I 2025-02-28 18:27:06,070] Trial 8 finished with value: 0.543352198600769 and parameters: {'n_estimators': 168, 'learning_rate': 0.07709481416442678, 'max_depth': 3, 'min_child_weight': 6, 'gamma': 0.6179366698460165, 'subsample': 0.6278972757798043, 'colsample_bytree': 0.8190965895343418, 'reg_alpha': 5.518448204079417, 'reg_lambda': 2.9877500989676453}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:17,093] Trial 9 finished with value: 0.44873361587524413 and parameters: {'n_estimators': 866, 'learning_rate': 0.1516948752160496, 'max_depth': 9, 'min_child_weight': 9, 'gamma': 0.6178197494002838, 'subsample': 0.9407339655954512, 'colsample_bytree': 0.6605986109530679, 'reg_alpha': 6.500892125923215, 'reg_lambda': 0.03437041373199001}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:18,599] Trial 10 finished with value: 0.5331526756286621 and parameters: {'n_estimators': 137, 'learning_rate': 0.02033192710934223, 'max_depth': 4, 'min_child_weight': 6, 'gamma': 0.292620892412038, 'subsample': 0.5216325280362273, 'colsample_bytree': 0.9840773811906716, 'reg_alpha': 8.155834349462769, 'reg_lambda': 3.9360639171120804}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:19,931] Trial 11 finished with value: 0.45643426179885865 and parameters: {'n_estimators': 106, 'learning_rate': 0.010488983845342616, 'max_depth': 4, 'min_child_weight': 6, 'gamma': 0.3049572471854668, 'subsample': 0.520442469968559, 'colsample_bytree': 0.9999609534850513, 'reg_alpha': 8.468519779713077, 'reg_lambda': 3.9708718272349968}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:21,183] Trial 12 finished with value: 0.5151201605796814 and parameters: {'n_estimators': 100, 'learning_rate': 0.025399498046396247, 'max_depth': 3, 'min_child_weight': 6, 'gamma': 0.36416676033415973, 'subsample': 0.5411003789643622, 'colsample_bytree': 0.8339552529517178, 'reg_alpha': 7.700643593211, 'reg_lambda': 4.036221719845212}. Best is trial 8 with value: 0.543352198600769.
[I 2025-02-28 18:27:23,985] Trial 13 finished with value: 0.5437254309654236 and parameters: {'n_estimators': 250, 'learning_rate': 0.06693226182648743, 'max_depth': 4, 'min_child_weight': 7, 'gamma': 0.3696765199367511, 'subsample': 0.794711808518893, 'colsample_bytree': 0.6865005344815053, 'reg_alpha': 9.910407108713995, 'reg_lambda': 2.2326642799541316}. Best is trial 13 with value: 0.5437254309654236.
[I 2025-02-28 18:27:26,539] Trial 14 finished with value: 0.5405412554740906 and parameters: {'n_estimators': 263, 'learning_rate': 0.06381385702672845, 'max_depth': 5, 'min_child_weight': 8, 'gamma': 0.4287584297754668, 'subsample': 0.8042622571793558, 'colsample_bytree': 0.6715707976463381, 'reg_alpha': 2.5083568190792556, 'reg_lambda': 2.2601519358002795}. Best is trial 13 with value: 0.5437254309654236.
[I 2025-02-28 18:27:28,581] Trial 15 finished with value: 0.5440799832344055 and parameters: {'n_estimators': 244, 'learning_rate': 0.07041223547960868, 'max_depth': 3, 'min_child_weight': 10, 'gamma': 0.7197881334465857, 'subsample': 0.7779178290732698, 'colsample_bytree': 0.679125411798041, 'reg_alpha': 9.569034426519458, 'reg_lambda': 0.12279704257417778}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:31,539] Trial 16 finished with value: 0.540407121181488 and parameters: {'n_estimators': 275, 'learning_rate': 0.06063294345361731, 'max_depth': 5, 'min_child_weight': 10, 'gamma': 0.7365524645454472, 'subsample': 0.8110906118418999, 'colsample_bytree': 0.623141542992063, 'reg_alpha': 9.781172279316053, 'reg_lambda': 0.43143826902046456}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:37,498] Trial 17 finished with value: 0.5101608753204345 and parameters: {'n_estimators': 537, 'learning_rate': 0.19241134263140267, 'max_depth': 5, 'min_child_weight': 10, 'gamma': 0.7467361833627015, 'subsample': 0.7717331640243381, 'colsample_bytree': 0.7264319782750525, 'reg_alpha': 7.115478280534627, 'reg_lambda': 1.1523235971868167}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:40,731] Trial 18 finished with value: 0.5394483804702759 and parameters: {'n_estimators': 285, 'learning_rate': 0.10229676823863273, 'max_depth': 4, 'min_child_weight': 7, 'gamma': 0.18883282239766835, 'subsample': 0.8371227008001572, 'colsample_bytree': 0.5736934015173419, 'reg_alpha': 9.06265115610064, 'reg_lambda': 2.3752129472463364}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:44,909] Trial 19 finished with value: 0.5307446241378784 and parameters: {'n_estimators': 500, 'learning_rate': 0.20147488495809995, 'max_depth': 3, 'min_child_weight': 9, 'gamma': 0.474104213089799, 'subsample': 0.7372191544654088, 'colsample_bytree': 0.6913039115734858, 'reg_alpha': 9.67396253041139, 'reg_lambda': 5.127762143253605}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:51,513] Trial 20 finished with value: 0.5363608479499817 and parameters: {'n_estimators': 208, 'learning_rate': 0.04951744310034004, 'max_depth': 7, 'min_child_weight': 9, 'gamma': 0.8011995402540919, 'subsample': 0.8614635610779272, 'colsample_bytree': 0.6017936502418365, 'reg_alpha': 7.17655323116587, 'reg_lambda': 0.9345585130656484}. Best is trial 15 with value: 0.5440799832344055.
[I 2025-02-28 18:27:54,395] Trial 21 finished with value: 0.5442478656768799 and parameters: {'n_estimators': 206, 'learning_rate': 0.08027934373962738, 'max_depth': 3, 'min_child_weight': 5, 'gamma': 0.5832246430248952, 'subsample': 0.6088060632152655, 'colsample_bytree': 0.7688938558463965, 'reg_alpha': 8.550184417955817, 'reg_lambda': 2.4928865849773194}. Best is trial 21 with value: 0.5442478656768799.
[I 2025-02-28 18:27:57,272] Trial 22 finished with value: 0.5443528175354004 and parameters: {'n_estimators': 345, 'learning_rate': 0.045342520753788654, 'max_depth': 3, 'min_child_weight': 4, 'gamma': 0.5383740706117337, 'subsample': 0.7662572032520517, 'colsample_bytree': 0.7860995489943112, 'reg_alpha': 8.46310279421163, 'reg_lambda': 2.1457065169311202}. Best is trial 22 with value: 0.5443528175354004.
[I 2025-02-28 18:28:02,431] Trial 23 finished with value: 0.5443742871284485 and parameters: {'n_estimators': 330, 'learning_rate': 0.03571713088640459, 'max_depth': 3, 'min_child_weight': 4, 'gamma': 0.5732685763841034, 'subsample': 0.6996412869755224, 'colsample_bytree': 0.7790660629800674, 'reg_alpha': 8.540430721935605, 'reg_lambda': 0.44990861881143535}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:05,876] Trial 24 finished with value: 0.5413497686386108 and parameters: {'n_estimators': 333, 'learning_rate': 0.03769392614153432, 'max_depth': 5, 'min_child_weight': 4, 'gamma': 0.5418371091725079, 'subsample': 0.6964528442045947, 'colsample_bytree': 0.7945786585066853, 'reg_alpha': 8.383487409926348, 'reg_lambda': 1.801227606365993}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:08,732] Trial 25 finished with value: 0.5407013773918152 and parameters: {'n_estimators': 360, 'learning_rate': 0.09383578193395664, 'max_depth': 3, 'min_child_weight': 4, 'gamma': 0.6247167867730725, 'subsample': 0.5759352383219941, 'colsample_bytree': 0.8604555455328667, 'reg_alpha': 7.284304617819455, 'reg_lambda': 3.08121700847077}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:27,239] Trial 26 finished with value: 0.5425023078918457 and parameters: {'n_estimators': 495, 'learning_rate': 0.03787779577950362, 'max_depth': 4, 'min_child_weight': 1, 'gamma': 0.491130039215511, 'subsample': 0.6594479465952411, 'colsample_bytree': 0.7750836297619202, 'reg_alpha': 8.680157456203698, 'reg_lambda': 5.495900414210266}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:30,868] Trial 27 finished with value: 0.53933265209198 and parameters: {'n_estimators': 335, 'learning_rate': 0.11736887581400667, 'max_depth': 3, 'min_child_weight': 5, 'gamma': 0.8187516904595493, 'subsample': 0.6099484443041393, 'colsample_bytree': 0.7510271924612915, 'reg_alpha': 6.49646428578942, 'reg_lambda': 0.7222028147330706}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:37,908] Trial 28 finished with value: 0.5147324204444885 and parameters: {'n_estimators': 195, 'learning_rate': 0.2434738648611366, 'max_depth': 5, 'min_child_weight': 3, 'gamma': 0.5672785585569524, 'subsample': 0.7457721548104954, 'colsample_bytree': 0.8683547328377962, 'reg_alpha': 7.924664906694303, 'reg_lambda': 4.253379164204221}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:44,959] Trial 29 finished with value: 0.5225183486938476 and parameters: {'n_estimators': 456, 'learning_rate': 0.04218092096635758, 'max_depth': 8, 'min_child_weight': 5, 'gamma': 0.4394513927468202, 'subsample': 0.7209905487524098, 'colsample_bytree': 0.7212782184436121, 'reg_alpha': 6.447263727827733, 'reg_lambda': 6.134843023455369}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:48,194] Trial 30 finished with value: 0.5387687802314758 and parameters: {'n_estimators': 335, 'learning_rate': 0.08359348295281316, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.6561493506066338, 'subsample': 0.5666550463672569, 'colsample_bytree': 0.7727076033000955, 'reg_alpha': 8.884417845650987, 'reg_lambda': 2.787253390415141}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:50,377] Trial 31 finished with value: 0.542467987537384 and parameters: {'n_estimators': 213, 'learning_rate': 0.12546995609031303, 'max_depth': 3, 'min_child_weight': 5, 'gamma': 0.6939132860420116, 'subsample': 0.7669770160461884, 'colsample_bytree': 0.7183955636632947, 'reg_alpha': 9.257910968739045, 'reg_lambda': 0.20156358306157268}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:28:55,319] Trial 32 finished with value: 0.5410897016525269 and parameters: {'n_estimators': 616, 'learning_rate': 0.05792957787817061, 'max_depth': 3, 'min_child_weight': 7, 'gamma': 0.5528689548586072, 'subsample': 0.6900205440050493, 'colsample_bytree': 0.6456551256638154, 'reg_alpha': 7.692909838687805, 'reg_lambda': 0.6841492551333954}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:29:02,763] Trial 33 finished with value: 0.45999300479888916 and parameters: {'n_estimators': 308, 'learning_rate': 0.10644845581984247, 'max_depth': 10, 'min_child_weight': 3, 'gamma': 0.8378298041242138, 'subsample': 0.7133136650658649, 'colsample_bytree': 0.7869546343704892, 'reg_alpha': 9.23133982282121, 'reg_lambda': 1.7464679446487787}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:29:05,964] Trial 34 finished with value: 0.5420541644096375 and parameters: {'n_estimators': 375, 'learning_rate': 0.07945091367909182, 'max_depth': 3, 'min_child_weight': 2, 'gamma': 0.025209625162848415, 'subsample': 0.7800664113430529, 'colsample_bytree': 0.8468592096780337, 'reg_alpha': 8.91771014584181, 'reg_lambda': 0.05031391603297647}. Best is trial 23 with value: 0.5443742871284485.
[I 2025-02-28 18:29:09,280] Trial 35 finished with value: 0.5453838586807251 and parameters: {'n_estimators': 432, 'learning_rate': 0.025260200065027993, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.7006081302005231, 'subsample': 0.6528657324270758, 'colsample_bytree': 0.7493578971046945, 'reg_alpha': 9.886845445607618, 'reg_lambda': 1.1707680158620883}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:12,614] Trial 36 finished with value: 0.5451103687286377 and parameters: {'n_estimators': 443, 'learning_rate': 0.02721077650020993, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.5700620659878068, 'subsample': 0.6460465525980261, 'colsample_bytree': 0.7552232657580044, 'reg_alpha': 8.443426859500827, 'reg_lambda': 8.766885625636503}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:16,873] Trial 37 finished with value: 0.545316469669342 and parameters: {'n_estimators': 445, 'learning_rate': 0.014051132538417561, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.9977188285355159, 'subsample': 0.665005073625462, 'colsample_bytree': 0.8194604822600398, 'reg_alpha': 3.3097653447085658, 'reg_lambda': 9.318801375913448}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:24,307] Trial 38 finished with value: 0.5432230710983277 and parameters: {'n_estimators': 616, 'learning_rate': 0.012728732176007318, 'max_depth': 6, 'min_child_weight': 3, 'gamma': 0.78526687548439, 'subsample': 0.6599673073106239, 'colsample_bytree': 0.9013964364712058, 'reg_alpha': 2.6688855103571254, 'reg_lambda': 9.683807813252347}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:29,578] Trial 39 finished with value: 0.5448433876037597 and parameters: {'n_estimators': 424, 'learning_rate': 0.03178572220544605, 'max_depth': 4, 'min_child_weight': 4, 'gamma': 0.975911918061199, 'subsample': 0.6436148191741508, 'colsample_bytree': 0.8176649964963084, 'reg_alpha': 3.743006884808041, 'reg_lambda': 8.864779494401295}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:46,659] Trial 40 finished with value: 0.5420290708541871 and parameters: {'n_estimators': 438, 'learning_rate': 0.030064471131231757, 'max_depth': 5, 'min_child_weight': 1, 'gamma': 0.9915905148295727, 'subsample': 0.6442874248620128, 'colsample_bytree': 0.8226221366239665, 'reg_alpha': 3.4700168401844587, 'reg_lambda': 8.777796041462576}. Best is trial 35 with value: 0.5453838586807251.
[I 2025-02-28 18:29:51,130] Trial 41 finished with value: 0.5458519577980041 and parameters: {'n_estimators': 403, 'learning_rate': 0.025181939608234893, 'max_depth': 4, 'min_child_weight': 2, 'gamma': 0.9450617049891935, 'subsample': 0.6775014498470824, 'colsample_bytree': 0.7494749285293014, 'reg_alpha': 1.605804505030021, 'reg_lambda': 8.650350883129157}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:29:55,292] Trial 42 finished with value: 0.545633852481842 and parameters: {'n_estimators': 569, 'learning_rate': 0.023855296802388265, 'max_depth': 4, 'min_child_weight': 2, 'gamma': 0.9890646565353011, 'subsample': 0.6669536484875446, 'colsample_bytree': 0.7424051756638951, 'reg_alpha': 1.5819971923894427, 'reg_lambda': 8.381041733267152}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:01,203] Trial 43 finished with value: 0.5420405983924865 and parameters: {'n_estimators': 618, 'learning_rate': 0.01723386549535197, 'max_depth': 6, 'min_child_weight': 2, 'gamma': 0.8546734957885499, 'subsample': 0.668317819630224, 'colsample_bytree': 0.7437699757132172, 'reg_alpha': 1.329928302833131, 'reg_lambda': 7.7227282297985225}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:05,770] Trial 44 finished with value: 0.5408416152000427 and parameters: {'n_estimators': 534, 'learning_rate': 0.05016731777932866, 'max_depth': 4, 'min_child_weight': 2, 'gamma': 0.8835635709906183, 'subsample': 0.6280066678165979, 'colsample_bytree': 0.7088640020832621, 'reg_alpha': 1.5477380341377909, 'reg_lambda': 8.21794504715965}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:12,207] Trial 45 finished with value: 0.5454214453697205 and parameters: {'n_estimators': 763, 'learning_rate': 0.011788495314032659, 'max_depth': 5, 'min_child_weight': 3, 'gamma': 0.9478992875376774, 'subsample': 0.5795941144840977, 'colsample_bytree': 0.7409612230172122, 'reg_alpha': 0.482345262428165, 'reg_lambda': 6.716675509475262}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:18,688] Trial 46 finished with value: 0.5449263453483582 and parameters: {'n_estimators': 798, 'learning_rate': 0.012899785159010545, 'max_depth': 5, 'min_child_weight': 1, 'gamma': 0.9487509035940388, 'subsample': 0.5764018864860738, 'colsample_bytree': 0.7392979778285507, 'reg_alpha': 0.05638133810238344, 'reg_lambda': 6.6210598087435315}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:25,159] Trial 47 finished with value: 0.5420300960540771 and parameters: {'n_estimators': 732, 'learning_rate': 0.02075742254358072, 'max_depth': 5, 'min_child_weight': 3, 'gamma': 0.9425923194929283, 'subsample': 0.5504265237884219, 'colsample_bytree': 0.7056346692644623, 'reg_alpha': 1.9886329876252316, 'reg_lambda': 7.099955651679609}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:35,435] Trial 48 finished with value: 0.43148418664932253 and parameters: {'n_estimators': 949, 'learning_rate': 0.2744329489887136, 'max_depth': 7, 'min_child_weight': 2, 'gamma': 0.8837822072415009, 'subsample': 0.502146947802395, 'colsample_bytree': 0.8086645056135917, 'reg_alpha': 0.6606127328061231, 'reg_lambda': 9.96597973111648}. Best is trial 41 with value: 0.5458519577980041.
[I 2025-02-28 18:30:41,309] Trial 49 finished with value: 0.5215312957763671 and parameters: {'n_estimators': 802, 'learning_rate': 0.16891002755333676, 'max_depth': 4, 'min_child_weight': 3, 'gamma': 0.980728074563575, 'subsample': 0.5898947936728967, 'colsample_bytree': 0.6542891725895289, 'reg_alpha': 1.1067781204775877, 'reg_lambda': 7.9935035748966525}. Best is trial 41 with value: 0.5458519577980041.
Best trial: 41
Best R2 score: 0.5459
Best parameters: {'n_estimators': 403, 'learning_rate': 0.025181939608234893, 'max_depth': 4, 'min_child_weight': 2, 'gamma': 0.9450617049891935, 'subsample': 0.6775014498470824, 'colsample_bytree': 0.7494749285293014, 'reg_alpha': 1.605804505030021, 'reg_lambda': 8.650350883129157}
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.7494749285293014, device=None,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, feature_types=None, gamma=0.9450617049891935,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.025181939608234893,
max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=4, max_leaves=None,
min_child_weight=2, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=403, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.7494749285293014, device=None,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, feature_types=None, gamma=0.9450617049891935,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.025181939608234893,
max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=4, max_leaves=None,
min_child_weight=2, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=403, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)tune_with_optuna('lightgbm')
[I 2025-02-28 18:33:22,410] A new study created in memory with name: no-name-907c1fb3-8399-4bed-bbc6-3b54147e7028
Tuning lightgbm with Optuna...
[I 2025-02-28 18:33:27,720] Trial 0 finished with value: 0.5326662266336206 and parameters: {'n_estimators': 629, 'learning_rate': 0.19440969104676345, 'max_depth': 6, 'num_leaves': 75, 'min_child_samples': 85, 'subsample': 0.6848108451598178, 'colsample_bytree': 0.5040101283257208, 'reg_alpha': 1.494238109285655, 'reg_lambda': 2.8567483167100804}. Best is trial 0 with value: 0.5326662266336206.
[I 2025-02-28 18:33:28,178] Trial 1 finished with value: 0.5410214784890865 and parameters: {'n_estimators': 150, 'learning_rate': 0.14099869668804374, 'max_depth': 4, 'num_leaves': 79, 'min_child_samples': 36, 'subsample': 0.8130008038302748, 'colsample_bytree': 0.7693373361439555, 'reg_alpha': 6.305577356850068, 'reg_lambda': 4.232341972652249}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:31,569] Trial 2 finished with value: 0.5377573182053613 and parameters: {'n_estimators': 779, 'learning_rate': 0.07472664147668032, 'max_depth': 4, 'num_leaves': 155, 'min_child_samples': 42, 'subsample': 0.9709402817685133, 'colsample_bytree': 0.7252879942027699, 'reg_alpha': 8.54104157618054, 'reg_lambda': 6.190736802834839}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:35,420] Trial 3 finished with value: 0.5303618511898678 and parameters: {'n_estimators': 659, 'learning_rate': 0.11377300036644683, 'max_depth': 10, 'num_leaves': 90, 'min_child_samples': 98, 'subsample': 0.67708965639584, 'colsample_bytree': 0.6419376081350582, 'reg_alpha': 1.5040045119179457, 'reg_lambda': 6.837348145437019}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:36,115] Trial 4 finished with value: 0.5378743214204842 and parameters: {'n_estimators': 309, 'learning_rate': 0.17421406214155305, 'max_depth': 3, 'num_leaves': 113, 'min_child_samples': 34, 'subsample': 0.9738099891070344, 'colsample_bytree': 0.7562528414871743, 'reg_alpha': 5.133529252218217, 'reg_lambda': 2.11155294172321}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:39,233] Trial 5 finished with value: 0.5215350745032434 and parameters: {'n_estimators': 341, 'learning_rate': 0.18767238417574686, 'max_depth': 10, 'num_leaves': 136, 'min_child_samples': 66, 'subsample': 0.8867151507232669, 'colsample_bytree': 0.9092772966637461, 'reg_alpha': 6.138544864189051, 'reg_lambda': 6.242063310000051}. Best is trial 1 with value: 0.5410214784890865.
[I 2025-02-28 18:33:42,612] Trial 6 finished with value: 0.5417250030856776 and parameters: {'n_estimators': 502, 'learning_rate': 0.029940907936526262, 'max_depth': 5, 'num_leaves': 112, 'min_child_samples': 37, 'subsample': 0.8965774632044667, 'colsample_bytree': 0.706188547974648, 'reg_alpha': 9.858552429397744, 'reg_lambda': 6.876734019667184}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:33:47,924] Trial 7 finished with value: 0.5215763778434364 and parameters: {'n_estimators': 796, 'learning_rate': 0.12560450508861962, 'max_depth': 6, 'num_leaves': 42, 'min_child_samples': 21, 'subsample': 0.6066766343884689, 'colsample_bytree': 0.6659956073752201, 'reg_alpha': 5.889095924211835, 'reg_lambda': 5.0554002307174954}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:33:52,614] Trial 8 finished with value: 0.5092863005557293 and parameters: {'n_estimators': 977, 'learning_rate': 0.27120086458914267, 'max_depth': 8, 'num_leaves': 29, 'min_child_samples': 79, 'subsample': 0.9243835981489068, 'colsample_bytree': 0.7532166091964823, 'reg_alpha': 8.9896729330966, 'reg_lambda': 5.630390304010492}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:33:56,688] Trial 9 finished with value: 0.5025441274070831 and parameters: {'n_estimators': 728, 'learning_rate': 0.2917512797636476, 'max_depth': 9, 'num_leaves': 150, 'min_child_samples': 28, 'subsample': 0.6336649862655541, 'colsample_bytree': 0.5572172130464979, 'reg_alpha': 5.527068948583429, 'reg_lambda': 7.285684249055668}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:33:59,809] Trial 10 finished with value: 0.5397143085926579 and parameters: {'n_estimators': 394, 'learning_rate': 0.023293553559949526, 'max_depth': 7, 'num_leaves': 195, 'min_child_samples': 9, 'subsample': 0.7886885096146093, 'colsample_bytree': 0.8865561891282174, 'reg_alpha': 9.37212168472918, 'reg_lambda': 9.180762799566521}. Best is trial 6 with value: 0.5417250030856776.
[I 2025-02-28 18:34:00,543] Trial 11 finished with value: 0.5440497774562261 and parameters: {'n_estimators': 169, 'learning_rate': 0.0575100237763993, 'max_depth': 4, 'num_leaves': 65, 'min_child_samples': 53, 'subsample': 0.806500640439573, 'colsample_bytree': 0.8368286546246911, 'reg_alpha': 7.5362380941522025, 'reg_lambda': 3.7573596446273814}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:01,227] Trial 12 finished with value: 0.5246318104414452 and parameters: {'n_estimators': 104, 'learning_rate': 0.020000160544995246, 'max_depth': 5, 'num_leaves': 46, 'min_child_samples': 52, 'subsample': 0.514173455947583, 'colsample_bytree': 0.9982699830405768, 'reg_alpha': 7.371816306408743, 'reg_lambda': 0.29518524208796215}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:02,712] Trial 13 finished with value: 0.5411925234210939 and parameters: {'n_estimators': 463, 'learning_rate': 0.0764391252126574, 'max_depth': 3, 'num_leaves': 105, 'min_child_samples': 58, 'subsample': 0.8589779586148867, 'colsample_bytree': 0.8593960774380958, 'reg_alpha': 3.349327734980789, 'reg_lambda': 8.825646631511722}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:03,766] Trial 14 finished with value: 0.5404643214473739 and parameters: {'n_estimators': 232, 'learning_rate': 0.07265748179368517, 'max_depth': 5, 'num_leaves': 65, 'min_child_samples': 52, 'subsample': 0.7541433203124146, 'colsample_bytree': 0.8225882980665425, 'reg_alpha': 7.808907338965442, 'reg_lambda': 3.893214122646761}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:06,271] Trial 15 finished with value: 0.5414486987170964 and parameters: {'n_estimators': 500, 'learning_rate': 0.03921614899461118, 'max_depth': 5, 'num_leaves': 119, 'min_child_samples': 68, 'subsample': 0.8487149098935276, 'colsample_bytree': 0.6525639278829782, 'reg_alpha': 9.854851648190847, 'reg_lambda': 8.237653827910606}. Best is trial 11 with value: 0.5440497774562261.
[I 2025-02-28 18:34:07,070] Trial 16 finished with value: 0.5443086205520119 and parameters: {'n_estimators': 260, 'learning_rate': 0.05354527857411969, 'max_depth': 4, 'num_leaves': 181, 'min_child_samples': 14, 'subsample': 0.9034527304964098, 'colsample_bytree': 0.9994159970452017, 'reg_alpha': 7.570311497851004, 'reg_lambda': 1.3405041995581706}. Best is trial 16 with value: 0.5443086205520119.
[I 2025-02-28 18:34:07,759] Trial 17 finished with value: 0.5295824853867158 and parameters: {'n_estimators': 228, 'learning_rate': 0.22438058810292544, 'max_depth': 4, 'num_leaves': 170, 'min_child_samples': 12, 'subsample': 0.7304901298156259, 'colsample_bytree': 0.9983176321159564, 'reg_alpha': 3.5530664186589047, 'reg_lambda': 0.3217559236443166}. Best is trial 16 with value: 0.5443086205520119.
[I 2025-02-28 18:34:08,327] Trial 18 finished with value: 0.542659887557329 and parameters: {'n_estimators': 227, 'learning_rate': 0.1074330533270406, 'max_depth': 3, 'num_leaves': 200, 'min_child_samples': 5, 'subsample': 0.8127746284476145, 'colsample_bytree': 0.9166221827780916, 'reg_alpha': 7.493120614100051, 'reg_lambda': 1.6881099685242342}. Best is trial 16 with value: 0.5443086205520119.
[I 2025-02-28 18:34:10,478] Trial 19 finished with value: 0.5314217033148203 and parameters: {'n_estimators': 380, 'learning_rate': 0.08711656962220418, 'max_depth': 7, 'num_leaves': 171, 'min_child_samples': 46, 'subsample': 0.9970050654981224, 'colsample_bytree': 0.9489513574602209, 'reg_alpha': 3.9791562588878326, 'reg_lambda': 3.3760319121733873}. Best is trial 16 with value: 0.5443086205520119.
[I 2025-02-28 18:34:11,014] Trial 20 finished with value: 0.5450057029343796 and parameters: {'n_estimators': 160, 'learning_rate': 0.06003691234250976, 'max_depth': 4, 'num_leaves': 58, 'min_child_samples': 22, 'subsample': 0.9435407081247855, 'colsample_bytree': 0.817209088334033, 'reg_alpha': 0.008716293076002302, 'reg_lambda': 1.6787101201614192}. Best is trial 20 with value: 0.5450057029343796.
[I 2025-02-28 18:34:13,453] Trial 21 finished with value: 0.5445812148281128 and parameters: {'n_estimators': 178, 'learning_rate': 0.051386159440118845, 'max_depth': 4, 'num_leaves': 57, 'min_child_samples': 23, 'subsample': 0.9217250689320882, 'colsample_bytree': 0.820165779592411, 'reg_alpha': 0.12838244355625866, 'reg_lambda': 1.7191303037263683}. Best is trial 20 with value: 0.5450057029343796.
[I 2025-02-28 18:34:19,634] Trial 22 finished with value: 0.5425661132752796 and parameters: {'n_estimators': 290, 'learning_rate': 0.04511697178277499, 'max_depth': 3, 'num_leaves': 21, 'min_child_samples': 19, 'subsample': 0.9267869847901363, 'colsample_bytree': 0.7994204834613143, 'reg_alpha': 0.038883105269208976, 'reg_lambda': 1.3753824302594748}. Best is trial 20 with value: 0.5450057029343796.
[I 2025-02-28 18:34:20,065] Trial 23 finished with value: 0.5456215903300075 and parameters: {'n_estimators': 105, 'learning_rate': 0.08836732558896855, 'max_depth': 4, 'num_leaves': 45, 'min_child_samples': 21, 'subsample': 0.9333867867274527, 'colsample_bytree': 0.9562564045513104, 'reg_alpha': 0.4431230557718911, 'reg_lambda': 1.0592541089500838}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:20,890] Trial 24 finished with value: 0.5405766869803859 and parameters: {'n_estimators': 112, 'learning_rate': 0.08746086685679198, 'max_depth': 6, 'num_leaves': 46, 'min_child_samples': 27, 'subsample': 0.9520788371967963, 'colsample_bytree': 0.9440650557990548, 'reg_alpha': 0.046171908853950505, 'reg_lambda': 2.244366121149166}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:21,918] Trial 25 finished with value: 0.535510589956807 and parameters: {'n_estimators': 199, 'learning_rate': 0.14901860687126928, 'max_depth': 5, 'num_leaves': 61, 'min_child_samples': 25, 'subsample': 0.8674362826129792, 'colsample_bytree': 0.8644110156075968, 'reg_alpha': 1.3753040510819154, 'reg_lambda': 0.13728504436942557}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:23,554] Trial 26 finished with value: 0.49995194757989825 and parameters: {'n_estimators': 160, 'learning_rate': 0.011295858507771685, 'max_depth': 4, 'num_leaves': 33, 'min_child_samples': 19, 'subsample': 0.9299751072228464, 'colsample_bytree': 0.7995893745529109, 'reg_alpha': 0.8271453383923701, 'reg_lambda': 0.9031907867979468}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:24,618] Trial 27 finished with value: 0.54136394751594 and parameters: {'n_estimators': 323, 'learning_rate': 0.09626595448073069, 'max_depth': 3, 'num_leaves': 88, 'min_child_samples': 30, 'subsample': 0.9938516134243033, 'colsample_bytree': 0.7976554971033438, 'reg_alpha': 2.405818067427746, 'reg_lambda': 2.5967127499683675}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:26,927] Trial 28 finished with value: 0.5201181333296389 and parameters: {'n_estimators': 427, 'learning_rate': 0.13194627344054186, 'max_depth': 6, 'num_leaves': 54, 'min_child_samples': 19, 'subsample': 0.9519086905723467, 'colsample_bytree': 0.9601844464005312, 'reg_alpha': 2.4385442493846776, 'reg_lambda': 3.045706229465728}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:31,801] Trial 29 finished with value: 0.4915880380404 and parameters: {'n_estimators': 600, 'learning_rate': 0.2163074058331411, 'max_depth': 6, 'num_leaves': 77, 'min_child_samples': 5, 'subsample': 0.8404272741760088, 'colsample_bytree': 0.5791174303702202, 'reg_alpha': 0.763131541151066, 'reg_lambda': 0.9138907938127938}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:32,783] Trial 30 finished with value: 0.5410774429353168 and parameters: {'n_estimators': 103, 'learning_rate': 0.05767722540963362, 'max_depth': 7, 'num_leaves': 96, 'min_child_samples': 42, 'subsample': 0.7550155301064561, 'colsample_bytree': 0.8841437378831947, 'reg_alpha': 1.823241040317661, 'reg_lambda': 4.8248189525894345}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:33,878] Trial 31 finished with value: 0.5447656530021804 and parameters: {'n_estimators': 254, 'learning_rate': 0.05808400877680589, 'max_depth': 4, 'num_leaves': 127, 'min_child_samples': 13, 'subsample': 0.8942519817661881, 'colsample_bytree': 0.9252226926952034, 'reg_alpha': 0.7700580765795652, 'reg_lambda': 1.3743962462567594}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:37,584] Trial 32 finished with value: 0.5429212425552031 and parameters: {'n_estimators': 174, 'learning_rate': 0.11050905579021411, 'max_depth': 4, 'num_leaves': 124, 'min_child_samples': 11, 'subsample': 0.88453908363, 'colsample_bytree': 0.911772275522064, 'reg_alpha': 0.5370022210437813, 'reg_lambda': 1.8440603852294066}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:39,240] Trial 33 finished with value: 0.5436706983035731 and parameters: {'n_estimators': 269, 'learning_rate': 0.06645477750923878, 'max_depth': 4, 'num_leaves': 72, 'min_child_samples': 23, 'subsample': 0.9529348589427186, 'colsample_bytree': 0.8425531247283318, 'reg_alpha': 2.1041811447376135, 'reg_lambda': 2.6945853442668843}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:39,925] Trial 34 finished with value: 0.5438783037552988 and parameters: {'n_estimators': 153, 'learning_rate': 0.037608832076791034, 'max_depth': 5, 'num_leaves': 134, 'min_child_samples': 15, 'subsample': 0.9169017549885521, 'colsample_bytree': 0.7105740579595161, 'reg_alpha': 1.1707672136006777, 'reg_lambda': 0.9705638843869945}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:40,428] Trial 35 finished with value: 0.5427154958498155 and parameters: {'n_estimators': 205, 'learning_rate': 0.08515035737680028, 'max_depth': 3, 'num_leaves': 99, 'min_child_samples': 35, 'subsample': 0.976577307146639, 'colsample_bytree': 0.9573780085655689, 'reg_alpha': 0.3574671209336951, 'reg_lambda': 2.097426660996745}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:41,574] Trial 36 finished with value: 0.5387107169068805 and parameters: {'n_estimators': 371, 'learning_rate': 0.10110259261973088, 'max_depth': 4, 'num_leaves': 88, 'min_child_samples': 30, 'subsample': 0.8374301871968041, 'colsample_bytree': 0.7818658554621556, 'reg_alpha': 1.0959483138214856, 'reg_lambda': 0.5995854012783639}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:43,202] Trial 37 finished with value: 0.5396968012491913 and parameters: {'n_estimators': 138, 'learning_rate': 0.12088174637805628, 'max_depth': 5, 'num_leaves': 56, 'min_child_samples': 41, 'subsample': 0.8796684297084698, 'colsample_bytree': 0.8863794724274013, 'reg_alpha': 2.9878992702263107, 'reg_lambda': 1.4572644846649254}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:43,869] Trial 38 finished with value: 0.5393594910745998 and parameters: {'n_estimators': 299, 'learning_rate': 0.17243087009432634, 'max_depth': 3, 'num_leaves': 39, 'min_child_samples': 16, 'subsample': 0.9594673375870353, 'colsample_bytree': 0.7432995022921125, 'reg_alpha': 4.1951285950019335, 'reg_lambda': 4.526524021636865}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:44,343] Trial 39 finished with value: 0.541503334093693 and parameters: {'n_estimators': 195, 'learning_rate': 0.04820279391746983, 'max_depth': 3, 'num_leaves': 23, 'min_child_samples': 34, 'subsample': 0.9089925422230793, 'colsample_bytree': 0.9157543526042334, 'reg_alpha': 1.6204116800140465, 'reg_lambda': 9.999165404425318}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:47,204] Trial 40 finished with value: 0.5402542217710835 and parameters: {'n_estimators': 891, 'learning_rate': 0.06386722099224824, 'max_depth': 4, 'num_leaves': 152, 'min_child_samples': 86, 'subsample': 0.9414246587526459, 'colsample_bytree': 0.8296867708818856, 'reg_alpha': 0.5138705769701981, 'reg_lambda': 2.9784502461790403}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:48,729] Trial 41 finished with value: 0.5445184518016682 and parameters: {'n_estimators': 265, 'learning_rate': 0.051112577038867596, 'max_depth': 4, 'num_leaves': 184, 'min_child_samples': 13, 'subsample': 0.8960916492228412, 'colsample_bytree': 0.9964304875925206, 'reg_alpha': 4.6547945352067925, 'reg_lambda': 1.0359297664635108}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:49,912] Trial 42 finished with value: 0.5442188536569572 and parameters: {'n_estimators': 258, 'learning_rate': 0.026715282361536048, 'max_depth': 4, 'num_leaves': 80, 'min_child_samples': 9, 'subsample': 0.8958234486894006, 'colsample_bytree': 0.9748527037126143, 'reg_alpha': 6.4933232426704475, 'reg_lambda': 0.037582154011755575}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:52,167] Trial 43 finished with value: 0.5427754113678465 and parameters: {'n_estimators': 340, 'learning_rate': 0.034722111409205815, 'max_depth': 5, 'num_leaves': 143, 'min_child_samples': 23, 'subsample': 0.9765221722257463, 'colsample_bytree': 0.9287061540163946, 'reg_alpha': 0.06984775689116524, 'reg_lambda': 2.399014779413541}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:53,066] Trial 44 finished with value: 0.49648814294248966 and parameters: {'n_estimators': 145, 'learning_rate': 0.01039799077016123, 'max_depth': 4, 'num_leaves': 51, 'min_child_samples': 17, 'subsample': 0.8667329383623265, 'colsample_bytree': 0.972684707238554, 'reg_alpha': 4.667484362671332, 'reg_lambda': 1.1404983522664032}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:54,907] Trial 45 finished with value: 0.5310781320211311 and parameters: {'n_estimators': 249, 'learning_rate': 0.07799566079346314, 'max_depth': 9, 'num_leaves': 35, 'min_child_samples': 7, 'subsample': 0.776795657756453, 'colsample_bytree': 0.8664037375368475, 'reg_alpha': 0.9866982748253421, 'reg_lambda': 1.9054159115405902}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:55,763] Trial 46 finished with value: 0.5444301348331371 and parameters: {'n_estimators': 191, 'learning_rate': 0.049908023287768236, 'max_depth': 4, 'num_leaves': 70, 'min_child_samples': 31, 'subsample': 0.7064742151056554, 'colsample_bytree': 0.9325350699218569, 'reg_alpha': 2.8681911029061515, 'reg_lambda': 0.6917947083432978}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:34:56,244] Trial 47 finished with value: 0.5389823110294107 and parameters: {'n_estimators': 100, 'learning_rate': 0.06892935443455488, 'max_depth': 3, 'num_leaves': 182, 'min_child_samples': 24, 'subsample': 0.8287928806960181, 'colsample_bytree': 0.6905499586878077, 'reg_alpha': 1.5501157097420872, 'reg_lambda': 1.6076542318068407}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:35:03,267] Trial 48 finished with value: 0.5292613035876131 and parameters: {'n_estimators': 659, 'learning_rate': 0.09668139637933358, 'max_depth': 5, 'num_leaves': 129, 'min_child_samples': 38, 'subsample': 0.6361764555366813, 'colsample_bytree': 0.9699246977527767, 'reg_alpha': 1.9624990748351678, 'reg_lambda': 3.4702833737538317}. Best is trial 23 with value: 0.5456215903300075.
[I 2025-02-28 18:35:05,061] Trial 49 finished with value: 0.525258471414587 and parameters: {'n_estimators': 298, 'learning_rate': 0.1327156660852742, 'max_depth': 6, 'num_leaves': 158, 'min_child_samples': 12, 'subsample': 0.9313406143953251, 'colsample_bytree': 0.8903897279513846, 'reg_alpha': 6.816620129550619, 'reg_lambda': 5.469460667657804}. Best is trial 23 with value: 0.5456215903300075.
Best trial: 23
Best R2 score: 0.5456
Best parameters: {'n_estimators': 105, 'learning_rate': 0.08836732558896855, 'max_depth': 4, 'num_leaves': 45, 'min_child_samples': 21, 'subsample': 0.9333867867274527, 'colsample_bytree': 0.9562564045513104, 'reg_alpha': 0.4431230557718911, 'reg_lambda': 1.0592541089500838}
LGBMRegressor(colsample_bytree=0.9562564045513104,
learning_rate=0.08836732558896855, max_depth=4,
min_child_samples=21, n_estimators=105, num_leaves=45,
reg_alpha=0.4431230557718911, reg_lambda=1.0592541089500838,
subsample=0.9333867867274527)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMRegressor(colsample_bytree=0.9562564045513104,
learning_rate=0.08836732558896855, max_depth=4,
min_child_samples=21, n_estimators=105, num_leaves=45,
reg_alpha=0.4431230557718911, reg_lambda=1.0592541089500838,
subsample=0.9333867867274527)tune_with_optuna('gbr')
[I 2025-02-28 18:37:07,435] A new study created in memory with name: no-name-bff5797b-c4c7-420e-8423-0eb272ac4979
Tuning gbr with Optuna...
[I 2025-02-28 18:37:17,518] Trial 0 finished with value: 0.5166258331501118 and parameters: {'n_estimators': 371, 'learning_rate': 0.23149802485829724, 'max_depth': 5, 'min_samples_split': 8, 'min_samples_leaf': 6, 'subsample': 0.8654801770757944, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:37:31,525] Trial 1 finished with value: 0.5011830303200927 and parameters: {'n_estimators': 597, 'learning_rate': 0.16297507958502855, 'max_depth': 6, 'min_samples_split': 16, 'min_samples_leaf': 5, 'subsample': 0.9385872347320672, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:37:49,758] Trial 2 finished with value: 0.46754387432227185 and parameters: {'n_estimators': 581, 'learning_rate': 0.13266622970545963, 'max_depth': 8, 'min_samples_split': 15, 'min_samples_leaf': 3, 'subsample': 0.8769286659977769, 'max_features': 'log2'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:37:59,836] Trial 3 finished with value: 0.48447491534964604 and parameters: {'n_estimators': 258, 'learning_rate': 0.21758901231806224, 'max_depth': 9, 'min_samples_split': 17, 'min_samples_leaf': 6, 'subsample': 0.6567767542664944, 'max_features': 'log2'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:38:33,678] Trial 4 finished with value: 0.48657479842045426 and parameters: {'n_estimators': 694, 'learning_rate': 0.14420086005981206, 'max_depth': 7, 'min_samples_split': 11, 'min_samples_leaf': 6, 'subsample': 0.6442223861292229, 'max_features': 'log2'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:38:51,121] Trial 5 finished with value: 0.513709258108584 and parameters: {'n_estimators': 729, 'learning_rate': 0.1880798784796406, 'max_depth': 5, 'min_samples_split': 3, 'min_samples_leaf': 10, 'subsample': 0.7923205103911215, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:39:05,886] Trial 6 finished with value: 0.39744537459852963 and parameters: {'n_estimators': 482, 'learning_rate': 0.2874701249273012, 'max_depth': 10, 'min_samples_split': 9, 'min_samples_leaf': 4, 'subsample': 0.8650893517864041, 'max_features': 'log2'}. Best is trial 0 with value: 0.5166258331501118.
[I 2025-02-28 18:39:09,982] Trial 7 finished with value: 0.5275963120649791 and parameters: {'n_estimators': 173, 'learning_rate': 0.15440197329999186, 'max_depth': 6, 'min_samples_split': 15, 'min_samples_leaf': 2, 'subsample': 0.9524441160270005, 'max_features': 'sqrt'}. Best is trial 7 with value: 0.5275963120649791.
[I 2025-02-28 18:39:30,982] Trial 8 finished with value: 0.4447553338956525 and parameters: {'n_estimators': 772, 'learning_rate': 0.14481270830876722, 'max_depth': 9, 'min_samples_split': 2, 'min_samples_leaf': 5, 'subsample': 0.8492243714089422, 'max_features': 'sqrt'}. Best is trial 7 with value: 0.5275963120649791.
[I 2025-02-28 18:39:50,694] Trial 9 finished with value: 0.5009965748591971 and parameters: {'n_estimators': 904, 'learning_rate': 0.09770195111574961, 'max_depth': 6, 'min_samples_split': 2, 'min_samples_leaf': 4, 'subsample': 0.845242795990188, 'max_features': 'sqrt'}. Best is trial 7 with value: 0.5275963120649791.
[I 2025-02-28 18:39:54,616] Trial 10 finished with value: 0.4856570793039988 and parameters: {'n_estimators': 100, 'learning_rate': 0.0158195981360717, 'max_depth': 3, 'min_samples_split': 20, 'min_samples_leaf': 1, 'subsample': 0.5534182536566528, 'max_features': None}. Best is trial 7 with value: 0.5275963120649791.
[I 2025-02-28 18:40:00,484] Trial 11 finished with value: 0.5322821113025176 and parameters: {'n_estimators': 249, 'learning_rate': 0.24880540941081658, 'max_depth': 4, 'min_samples_split': 7, 'min_samples_leaf': 8, 'subsample': 0.9793200911044663, 'max_features': 'sqrt'}. Best is trial 11 with value: 0.5322821113025176.
[I 2025-02-28 18:40:06,964] Trial 12 finished with value: 0.5341450618057997 and parameters: {'n_estimators': 143, 'learning_rate': 0.2841875556141446, 'max_depth': 3, 'min_samples_split': 6, 'min_samples_leaf': 9, 'subsample': 0.9978996570520571, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:21,958] Trial 13 finished with value: 0.5191535779959852 and parameters: {'n_estimators': 313, 'learning_rate': 0.2993437119857678, 'max_depth': 3, 'min_samples_split': 6, 'min_samples_leaf': 9, 'subsample': 0.9944523744088988, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:37,685] Trial 14 finished with value: 0.5056679911090216 and parameters: {'n_estimators': 418, 'learning_rate': 0.2471412387561913, 'max_depth': 4, 'min_samples_split': 6, 'min_samples_leaf': 8, 'subsample': 0.9982987291810923, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:45,036] Trial 15 finished with value: 0.5092445560008925 and parameters: {'n_estimators': 245, 'learning_rate': 0.2639713069965241, 'max_depth': 4, 'min_samples_split': 5, 'min_samples_leaf': 8, 'subsample': 0.7141662972999565, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:52,710] Trial 16 finished with value: 0.5267751128450314 and parameters: {'n_estimators': 184, 'learning_rate': 0.2084217747264252, 'max_depth': 4, 'min_samples_split': 11, 'min_samples_leaf': 8, 'subsample': 0.9252964221661318, 'max_features': None}. Best is trial 12 with value: 0.5341450618057997.
[I 2025-02-28 18:40:56,885] Trial 17 finished with value: 0.5376825710642157 and parameters: {'n_estimators': 106, 'learning_rate': 0.2671548773506419, 'max_depth': 3, 'min_samples_split': 8, 'min_samples_leaf': 10, 'subsample': 0.7880943235243648, 'max_features': None}. Best is trial 17 with value: 0.5376825710642157.
[I 2025-02-28 18:41:00,748] Trial 18 finished with value: 0.5416257967246378 and parameters: {'n_estimators': 123, 'learning_rate': 0.09423480329576486, 'max_depth': 3, 'min_samples_split': 9, 'min_samples_leaf': 10, 'subsample': 0.512923259349313, 'max_features': None}. Best is trial 18 with value: 0.5416257967246378.
[I 2025-02-28 18:41:13,212] Trial 19 finished with value: 0.5208678147341531 and parameters: {'n_estimators': 462, 'learning_rate': 0.07019810766489495, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 10, 'subsample': 0.5168346995942698, 'max_features': None}. Best is trial 18 with value: 0.5416257967246378.
[I 2025-02-28 18:41:22,515] Trial 20 finished with value: 0.5416957030088807 and parameters: {'n_estimators': 361, 'learning_rate': 0.07369596953659807, 'max_depth': 3, 'min_samples_split': 13, 'min_samples_leaf': 10, 'subsample': 0.7440250200014602, 'max_features': None}. Best is trial 20 with value: 0.5416957030088807.
[I 2025-02-28 18:41:33,415] Trial 21 finished with value: 0.5426331486901319 and parameters: {'n_estimators': 338, 'learning_rate': 0.06477196411506755, 'max_depth': 3, 'min_samples_split': 13, 'min_samples_leaf': 10, 'subsample': 0.7787886752931414, 'max_features': None}. Best is trial 21 with value: 0.5426331486901319.
[I 2025-02-28 18:41:44,917] Trial 22 finished with value: 0.5428936591876395 and parameters: {'n_estimators': 347, 'learning_rate': 0.06064214039077384, 'max_depth': 3, 'min_samples_split': 12, 'min_samples_leaf': 9, 'subsample': 0.7127842938125919, 'max_features': None}. Best is trial 22 with value: 0.5428936591876395.
[I 2025-02-28 18:42:03,260] Trial 23 finished with value: 0.543469553667053 and parameters: {'n_estimators': 353, 'learning_rate': 0.039199470874611626, 'max_depth': 4, 'min_samples_split': 14, 'min_samples_leaf': 9, 'subsample': 0.739062694687006, 'max_features': None}. Best is trial 23 with value: 0.543469553667053.
[I 2025-02-28 18:42:21,265] Trial 24 finished with value: 0.5457570109135942 and parameters: {'n_estimators': 532, 'learning_rate': 0.013201142547601463, 'max_depth': 4, 'min_samples_split': 13, 'min_samples_leaf': 7, 'subsample': 0.6777821390980873, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:42:42,385] Trial 25 finished with value: 0.5439261216932461 and parameters: {'n_estimators': 532, 'learning_rate': 0.010039986736273668, 'max_depth': 5, 'min_samples_split': 13, 'min_samples_leaf': 7, 'subsample': 0.6498643321838303, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:43:01,640] Trial 26 finished with value: 0.5436549026412915 and parameters: {'n_estimators': 527, 'learning_rate': 0.012657542037501569, 'max_depth': 5, 'min_samples_split': 18, 'min_samples_leaf': 7, 'subsample': 0.632831158101171, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:43:39,045] Trial 27 finished with value: 0.5436108167392069 and parameters: {'n_estimators': 648, 'learning_rate': 0.010479491492067202, 'max_depth': 5, 'min_samples_split': 19, 'min_samples_leaf': 7, 'subsample': 0.6263160862609665, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:44:17,342] Trial 28 finished with value: 0.5188927330135028 and parameters: {'n_estimators': 526, 'learning_rate': 0.03144435154272926, 'max_depth': 7, 'min_samples_split': 18, 'min_samples_leaf': 7, 'subsample': 0.6064182077680091, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:44:50,139] Trial 29 finished with value: 0.5318818151914331 and parameters: {'n_estimators': 528, 'learning_rate': 0.037078760979773806, 'max_depth': 5, 'min_samples_split': 17, 'min_samples_leaf': 7, 'subsample': 0.6793627926448832, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:45:17,226] Trial 30 finished with value: 0.5200594442228811 and parameters: {'n_estimators': 828, 'learning_rate': 0.10160343797110957, 'max_depth': 5, 'min_samples_split': 14, 'min_samples_leaf': 6, 'subsample': 0.5913224517175929, 'max_features': 'log2'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:45:46,037] Trial 31 finished with value: 0.5437248279671836 and parameters: {'n_estimators': 642, 'learning_rate': 0.010718707706394888, 'max_depth': 5, 'min_samples_split': 20, 'min_samples_leaf': 7, 'subsample': 0.6192044818969257, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:46:20,067] Trial 32 finished with value: 0.5279056342257883 and parameters: {'n_estimators': 648, 'learning_rate': 0.024977632144370873, 'max_depth': 6, 'min_samples_split': 20, 'min_samples_leaf': 7, 'subsample': 0.5747001517631624, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:46:40,594] Trial 33 finished with value: 0.521898238645211 and parameters: {'n_estimators': 562, 'learning_rate': 0.053136898230836574, 'max_depth': 5, 'min_samples_split': 18, 'min_samples_leaf': 5, 'subsample': 0.6810347705111421, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:47:22,430] Trial 34 finished with value: 0.5012564856508492 and parameters: {'n_estimators': 616, 'learning_rate': 0.044852191212167415, 'max_depth': 7, 'min_samples_split': 16, 'min_samples_leaf': 6, 'subsample': 0.5526377301388273, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:47:45,550] Trial 35 finished with value: 0.5404149482117955 and parameters: {'n_estimators': 480, 'learning_rate': 0.012788846087546015, 'max_depth': 6, 'min_samples_split': 16, 'min_samples_leaf': 7, 'subsample': 0.6614972336753118, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:47:52,161] Trial 36 finished with value: 0.5356853105269188 and parameters: {'n_estimators': 421, 'learning_rate': 0.11546072593103919, 'max_depth': 4, 'min_samples_split': 19, 'min_samples_leaf': 5, 'subsample': 0.6306622817556439, 'max_features': 'log2'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:48:27,370] Trial 37 finished with value: 0.5238012786375147 and parameters: {'n_estimators': 703, 'learning_rate': 0.027445701082624965, 'max_depth': 6, 'min_samples_split': 17, 'min_samples_leaf': 6, 'subsample': 0.7075722781366109, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:48:54,141] Trial 38 finished with value: 0.4803893172806074 and parameters: {'n_estimators': 585, 'learning_rate': 0.08241134882850128, 'max_depth': 7, 'min_samples_split': 15, 'min_samples_leaf': 7, 'subsample': 0.6158320947239292, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:49:12,312] Trial 39 finished with value: 0.5191469540618101 and parameters: {'n_estimators': 765, 'learning_rate': 0.04867681569444588, 'max_depth': 8, 'min_samples_split': 19, 'min_samples_leaf': 8, 'subsample': 0.6556495911466975, 'max_features': 'log2'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:49:27,121] Trial 40 finished with value: 0.51445861942503 and parameters: {'n_estimators': 996, 'learning_rate': 0.17604560515527518, 'max_depth': 4, 'min_samples_split': 12, 'min_samples_leaf': 5, 'subsample': 0.6865704822754182, 'max_features': 'sqrt'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:49:49,832] Trial 41 finished with value: 0.5428321400737083 and parameters: {'n_estimators': 656, 'learning_rate': 0.012314841065014846, 'max_depth': 5, 'min_samples_split': 19, 'min_samples_leaf': 7, 'subsample': 0.6321577435982716, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:50:16,816] Trial 42 finished with value: 0.5429624565039538 and parameters: {'n_estimators': 630, 'learning_rate': 0.012421853867381209, 'max_depth': 5, 'min_samples_split': 20, 'min_samples_leaf': 6, 'subsample': 0.5769075212947131, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:50:40,490] Trial 43 finished with value: 0.5305765917532641 and parameters: {'n_estimators': 532, 'learning_rate': 0.025803873387967642, 'max_depth': 6, 'min_samples_split': 18, 'min_samples_leaf': 7, 'subsample': 0.6127969490955832, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:51:07,869] Trial 44 finished with value: 0.5169077552483072 and parameters: {'n_estimators': 664, 'learning_rate': 0.05177198947130677, 'max_depth': 5, 'min_samples_split': 16, 'min_samples_leaf': 4, 'subsample': 0.6490376456361933, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:51:17,510] Trial 45 finished with value: 0.5335218764322195 and parameters: {'n_estimators': 439, 'learning_rate': 0.010319344258733786, 'max_depth': 4, 'min_samples_split': 19, 'min_samples_leaf': 6, 'subsample': 0.5415114067883431, 'max_features': 'sqrt'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:51:36,072] Trial 46 finished with value: 0.5419270218451692 and parameters: {'n_estimators': 576, 'learning_rate': 0.028425282082328862, 'max_depth': 5, 'min_samples_split': 14, 'min_samples_leaf': 8, 'subsample': 0.5953717473890078, 'max_features': 'log2'}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:52:16,879] Trial 47 finished with value: 0.5161883235424825 and parameters: {'n_estimators': 720, 'learning_rate': 0.037834002926623006, 'max_depth': 6, 'min_samples_split': 17, 'min_samples_leaf': 8, 'subsample': 0.6315956912901397, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:52:42,653] Trial 48 finished with value: 0.5399822816152737 and parameters: {'n_estimators': 515, 'learning_rate': 0.021419953513921924, 'max_depth': 5, 'min_samples_split': 20, 'min_samples_leaf': 6, 'subsample': 0.6687823840080587, 'max_features': None}. Best is trial 24 with value: 0.5457570109135942.
[I 2025-02-28 18:53:18,400] Trial 49 finished with value: 0.4661191543056004 and parameters: {'n_estimators': 750, 'learning_rate': 0.12102882067315053, 'max_depth': 10, 'min_samples_split': 18, 'min_samples_leaf': 7, 'subsample': 0.704352712615413, 'max_features': 'sqrt'}. Best is trial 24 with value: 0.5457570109135942.
Best trial: 24
Best R2 score: 0.5458
Best parameters: {'n_estimators': 532, 'learning_rate': 0.013201142547601463, 'max_depth': 4, 'min_samples_split': 13, 'min_samples_leaf': 7, 'subsample': 0.6777821390980873, 'max_features': None}
GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,
min_samples_leaf=7, min_samples_split=13,
n_estimators=532, subsample=0.6777821390980873)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,
min_samples_leaf=7, min_samples_split=13,
n_estimators=532, subsample=0.6777821390980873)GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,random_state=42, subsample=0.8)
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1.0, device=None, early_stopping_rounds=None,enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.2, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.05, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=3, max_leaves=None,min_child_weight=3, missing=nan, monotone_constraints=None,multi_strategy=None, n_estimators=500, n_jobs=None,num_parallel_tree=None, random_state=42, ...)
LGBMRegressor(colsample_bytree=0.6, learning_rate=0.05, max_depth=3,min_child_samples=5, n_estimators=500, num_leaves=50,random_state=42, reg_alpha=10, reg_lambda=1, subsample=0.6,verbose=-1)
optima
XGBRegressor(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=0.7494749285293014, device=None,early_stopping_rounds=None, enable_categorical=False,eval_metric=None, feature_types=None, gamma=0.9450617049891935,grow_policy=None, importance_type=None,interaction_constraints=None, learning_rate=0.025181939608234893,max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=4, max_leaves=None,min_child_weight=2, missing=nan, monotone_constraints=None,multi_strategy=None, n_estimators=403, n_jobs=None,num_parallel_tree=None, random_state=None)
LGBMRegressor(colsample_bytree=0.9562564045513104,learning_rate=0.08836732558896855, max_depth=4,min_child_samples=21, n_estimators=105, num_leaves=45,reg_alpha=0.4431230557718911, reg_lambda=1.0592541089500838,subsample=0.9333867867274527)
GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,min_samples_leaf=7, min_samples_split=13,n_estimators=532, subsample=0.6777821390980873)
# Define a dictionary of regression models
regression_models_3 = {
"GradientBoosting_Tuned_2": GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,random_state=42, subsample=0.8),
"XGBoost_Tuned_2": xgb.XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1.0, device=None, early_stopping_rounds=None,enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.2, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.05, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=3, max_leaves=None,min_child_weight=3, monotone_constraints=None,multi_strategy=None, n_estimators=500, n_jobs=None,num_parallel_tree=None, random_state=42),
"LightGBM_Tuned_2": lgb.LGBMRegressor(colsample_bytree=0.6, learning_rate=0.05, max_depth=3,min_child_samples=5, n_estimators=500, num_leaves=50,random_state=42, reg_alpha=10, reg_lambda=1, subsample=0.6,verbose=-1),
"GradientBoosting_Optima": GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,min_samples_leaf=7, min_samples_split=13,n_estimators=532, subsample=0.6777821390980873),
"XGBoost_Tuned_Optima": xgb.XGBRegressor(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=0.7494749285293014, device=None,early_stopping_rounds=None, enable_categorical=False,eval_metric=None, feature_types=None, gamma=0.9450617049891935,grow_policy=None, importance_type=None,interaction_constraints=None, learning_rate=0.025181939608234893,max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=4, max_leaves=None,min_child_weight=2, monotone_constraints=None,multi_strategy=None, n_estimators=403, n_jobs=None,num_parallel_tree=None, random_state=None),
"LightGBM_Tuned_Optima": lgb.LGBMRegressor(colsample_bytree=0.9562564045513104,learning_rate=0.08836732558896855, max_depth=4,min_child_samples=21, n_estimators=105, num_leaves=45,reg_alpha=0.4431230557718911, reg_lambda=1.0592541089500838,subsample=0.9333867867274527,verbose=-1)
}
# Initialize an empty DataFrame to store results
results_df_3 = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
%%time
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models_3.items():
model.fit(X_train, y_train)
metrics = evaluate_model(model, X_test, y_test)
metrics["Model"] = model_name # Add model name for reference
results_df_3 = pd.concat([results_df_3, pd.DataFrame([metrics])], ignore_index=True)
CPU times: total: 9.75 s Wall time: 10.6 s
# Display the results DataFrame
# Tuned Models
results_df_3.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 4 | XGBoost_Tuned_Optima | 190.967571 | 62720.926919 | 250.441464 | 0.551006 |
| 3 | GradientBoosting_Optima | 190.906116 | 62859.987800 | 250.718942 | 0.550011 |
| 0 | GradientBoosting_Tuned_2 | 191.385545 | 62994.960983 | 250.987970 | 0.549045 |
| 5 | LightGBM_Tuned_Optima | 191.011681 | 63051.358720 | 251.100296 | 0.548641 |
| 1 | XGBoost_Tuned_2 | 191.250406 | 63130.225737 | 251.257290 | 0.548076 |
| 2 | LightGBM_Tuned_2 | 191.907873 | 63205.786577 | 251.407610 | 0.547535 |
# Display the results DataFrame
# Advanced Models
results_df_2.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 2 | GradientBoosting_Tuned_1 | 190.979607 | 63554.668646 | 252.100513 | 0.545038 |
| 4 | LightGBM_Tuned_1 | 191.308692 | 63653.691737 | 252.296833 | 0.544329 |
| 3 | XGBoost_Tuned_1 | 191.172298 | 63903.217757 | 252.790858 | 0.542543 |
| 1 | RandomForest_Tuned_1 | 192.255939 | 64194.833169 | 253.366993 | 0.540455 |
| 5 | NeuralNetwork(MLP) | 201.536209 | 66683.063761 | 258.230641 | 0.522643 |
| 0 | DecisionTree_Tuned_1 | 197.879923 | 70119.813941 | 264.801461 | 0.498041 |
# Display the results DataFrame
# Regression Models
results_df.sort_values(by="R2 Score", ascending=False)
| Model | MAE | MSE | RMSE | R2 Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 203.946460 | 68259.489536 | 261.265171 | 0.511358 |
| 2 | Ridge Regression | 203.959557 | 68260.799293 | 261.267677 | 0.511349 |
| 1 | Lasso Regression | 205.169206 | 68709.866263 | 262.125669 | 0.508134 |
| 4 | Random Forest | 203.172925 | 74554.912118 | 273.047454 | 0.466292 |
| 5 | K-Nearest Neighbors | 204.794534 | 74634.435609 | 273.193037 | 0.465722 |
| 6 | Support Vector Regressor | 216.052970 | 89802.139881 | 299.670052 | 0.357143 |
| 3 | Decision Tree | 234.987992 | 106660.526170 | 326.589232 | 0.236460 |
- After model tunning, another slight improvement was achieved
- The best R2 Score from the tuned models is currently 0.5510 with the XGBoost_Tuned_Optima model.
- The best R2 Score achieved is low, and still suggest that the models are not explaining a significant portion of the variance in the target variable.
Revisit¶
- So far, various regression models were tested (Linear, Lasso, Ridge, Decision Tree, Random Forest, KNN).
- Feature selection and transformation steps were applied.
- Hyperparameter tuning was performed for some models (Random Forest, Gradient Boosting, XGBoost, LightGBM
- Optuna was used for advanced hyperparameter tuning.
- Despite all this, the R² score remained low
- In this section, a diferent approach will be tested aiming to reach better results.
# Outlier handling
df7=df4[(np.abs(df4.select_dtypes(include=np.number).apply(zscore))<3).all(axis=1)] #drop over 3 standard deviations
count_outliers(df7)
price: 1345 outliers (8.67%) rooms: 236 outliers (1.52%) bathroom: 0 outliers (0.00%) square_meters: 899 outliers (5.80%) square_meters_price: 897 outliers (5.78%)
3377
df7.shape
(15511, 9)
- On this new approach, not all outliers will be removed.
- Only will be removed those over 3 standard deviations
- The remaining data will be considered as valid for modeling information, due it captures high areas or luxury units
df7.to_csv('df7_WITHOUT OUTLIERS 3SD_DATA.csv', index=False) # Save a copy of data after outliers handling new approach
# Load data
data = df7.copy()
# Create dummy variables for categorical features with specified baseline categories
data = pd.get_dummies(data, columns=['real_state', 'neighborhood'], drop_first=False)
for feature, baseline in {'real_state': "flat", 'neighborhood': "Eixample"}.items():
if f"{feature}_{baseline}" in data.columns:
data.drop(columns=[f"{feature}_{baseline}"], inplace=True)
- Feature selection done considering "real_state_flat" and "neighborhood_Eixample" as the base line categories for one-hot encoding
# Convert boolean columns to numeric (0 and 1)
bool_cols = data.select_dtypes(['bool']).columns
data[bool_cols] = data[bool_cols].astype(int)
univariate_numerical(data)
data.to_csv('df7_MODELING_DATA.csv', index=False) # Save a copy of data ready for modeling
data.columns
Index(['price', 'rooms', 'bathroom', 'lift', 'terrace', 'square_meters',
'square_meters_price', 'real_state_apartment', 'real_state_attic',
'real_state_study', 'neighborhood_Ciutat Vella', 'neighborhood_Gràcia',
'neighborhood_Horta- Guinardo', 'neighborhood_Les Corts',
'neighborhood_Nou Barris', 'neighborhood_Sant Andreu',
'neighborhood_Sant Martí', 'neighborhood_Sants-Montjuïc',
'neighborhood_Sarria-Sant Gervasi'],
dtype='object')
# Drop 'square_meter_price' from features
X = data.drop(columns=['price','square_meters_price'])
y = data['price']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature engineering
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
# Stacking Regressor
base_models = [
('ridge', Ridge(alpha=1.0)),
('lasso', Lasso(alpha=0.1)),
('svr', SVR(kernel='rbf'))
]
meta_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1)
stacking_model = StackingRegressor(estimators=base_models, final_estimator=meta_model)
# Cross-validation
cv_scores = cross_val_score(stacking_model, X_train_scaled, y_train, cv=5, scoring='r2')
print("Mean R2 Score from Cross Validation:", np.mean(cv_scores))
# Train and Evaluate
stacking_model.fit(X_train_scaled, y_train)
y_pred = stacking_model.predict(X_test_scaled)
Mean R2 Score from Cross Validation: 0.5837427472780874
- Polynomial features are derived by raising existing numerical features to a power (e.g., squared, cubic) and creating interaction terms between different features. This extends the linear model to capture non-linear relationships in the data
- Adding polynomial and interaction terms can help the model learn more complex relationships between features, improving performance
- If Housing prices are influenced by complex interactions between features like square meters, number of rooms, and location, the a linear model might fail to capture these nuances.
- Stacking is an ensemble learning technique that combines multiple base models to make better predictions. It works in two main stages: a) Train base models independently: Several regressors (e.g., Random Forest, XGBoost, LightGBM) make individual predictions. b) Meta-model learns from base model outputs: A final estimator (often a linear model or another tree-based model) takes the predictions from the base models as inputs and learns to optimize the final prediction
- While individual models may overfit, the stacking regressor generalizes better by learning which model performs best in different scenarios
- Achieved R2 Score 0.58, still low
- Final aproach will be done considering Feature Engineering Enhancements like Log Transformation for Skewed Data and Polynomial & Interaction Features, Staking Modeling will be applied but testing other base models.
Final Modeling¶
target = "price"
features = [col for col in data.columns if col not in [target, "square_meters_price"]]
X = data[features]
y = data[target]
X.columns
Index(['rooms', 'bathroom', 'lift', 'terrace', 'square_meters',
'real_state_apartment', 'real_state_attic', 'real_state_study',
'neighborhood_Ciutat Vella', 'neighborhood_Gràcia',
'neighborhood_Horta- Guinardo', 'neighborhood_Les Corts',
'neighborhood_Nou Barris', 'neighborhood_Sant Andreu',
'neighborhood_Sant Martí', 'neighborhood_Sants-Montjuïc',
'neighborhood_Sarria-Sant Gervasi'],
dtype='object')
# Apply Log Transformation to Reduce Skewness
y = np.log1p(y)
# Create Polynomial & Interaction Features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
# Standardize Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Define Base Models
rf = RandomForestRegressor(n_estimators=300, max_depth=20, min_samples_split=5, random_state=42)
xgbr = xgb.XGBRegressor(n_estimators=300, max_depth=10, learning_rate=0.05, random_state=42)
lgbr = lgb.LGBMRegressor(n_estimators=300, max_depth=10, learning_rate=0.05, random_state=42, verbose=-1)
# Stacking Model
stacked_model = StackingRegressor(
estimators=[("rf", rf), ("xgb", xgbr), ("lgb", lgbr)],
final_estimator=xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
)
# Train Model
stacked_model.fit(X_train, y_train)
# Evaluate Model
r2_score = stacked_model.score(X_test, y_test)
print(f"Improved R² Score: {r2_score:.4f}")
Improved R² Score: 0.6504
vif_series = pd.Series(
[variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
index=X.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: rooms 11.126086 bathroom 13.322090 lift 3.227964 terrace 1.382929 square_meters 15.947385 real_state_apartment 1.135441 real_state_attic 1.093636 real_state_study 1.023852 neighborhood_Ciutat Vella 1.354277 neighborhood_Gràcia 1.192324 neighborhood_Horta- Guinardo 1.095892 neighborhood_Les Corts 1.167469 neighborhood_Nou Barris 1.029682 neighborhood_Sant Andreu 1.059112 neighborhood_Sant Martí 1.192783 neighborhood_Sants-Montjuïc 1.162775 neighborhood_Sarria-Sant Gervasi 1.452696 dtype: float64
- Applyed Log Transformation to Reduce Skewness. This transformation helps normalize right-skewed distributions, making the data more symmetrical and better suited for linear models.
- The dataset's 'price' distribution (as seen in the histogram) is highly skewed, and many machine learning models (like linear regression and tree-based models) perform better with normally distributed data.
- Applying the log transformation reduces the effect of extreme values (e.g., luxury properties with abnormally high prices) and improves the model's ability to capture general trends.
- Decision trees (Random Forest), gradient boosting (XGBoost, LightGBM) each have unique strengths in handling structured data. Stacking allows leveraging multiple perspectives.
- Whit this implementation it was achieved an improved R² Score 0.65
- New test removing high VIF variables and variating models
#Data load
inputs=pd.read_csv('df3_FATURE ENGINEERED_DATA.csv') # Original data, without missing values and feature "Unnamed"
data=inputs.copy()
# Outlier handling (drop values >3 standard deviations)
data = data[(np.abs(data.select_dtypes(include=np.number).apply(zscore)) < 3).all(axis=1)]
# Create dummy variables for categorical features with specified baseline categories
data = pd.get_dummies(data, columns=['real_state', 'neighborhood'], drop_first=False)
for feature, baseline in {'real_state': "flat", 'neighborhood': "Eixample"}.items():
if f"{feature}_{baseline}" in data.columns:
data.drop(columns=[f"{feature}_{baseline}"], inplace=True)
# Convert boolean columns to numeric (0 and 1)
bool_cols = data.select_dtypes(['bool']).columns
data[bool_cols] = data[bool_cols].astype(int)
target = "price"
outofmodel = ["square_meters_price", "rooms","bathroom"]
exclude_cols = [target] + outofmodel
features = [col for col in data.columns if col not in exclude_cols]
X = data[features]
y = data[target]
print(f'Original data features: {list(inputs.columns)}')
print(f'Modeling data features: {list(data.columns)}')
print(f'Output variable: {target}')
print(f'Input variables: {list(X.columns)}')
Original data features: ['price', 'rooms', 'bathroom', 'lift', 'terrace', 'square_meters', 'real_state', 'neighborhood', 'square_meters_price'] Modeling data features: ['price', 'rooms', 'bathroom', 'lift', 'terrace', 'square_meters', 'square_meters_price', 'real_state_apartment', 'real_state_attic', 'real_state_study', 'neighborhood_Ciutat Vella', 'neighborhood_Gràcia', 'neighborhood_Horta- Guinardo', 'neighborhood_Les Corts', 'neighborhood_Nou Barris', 'neighborhood_Sant Andreu', 'neighborhood_Sant Martí', 'neighborhood_Sants-Montjuïc', 'neighborhood_Sarria-Sant Gervasi'] Output variable: price Input variables: ['lift', 'terrace', 'square_meters', 'real_state_apartment', 'real_state_attic', 'real_state_study', 'neighborhood_Ciutat Vella', 'neighborhood_Gràcia', 'neighborhood_Horta- Guinardo', 'neighborhood_Les Corts', 'neighborhood_Nou Barris', 'neighborhood_Sant Andreu', 'neighborhood_Sant Martí', 'neighborhood_Sants-Montjuïc', 'neighborhood_Sarria-Sant Gervasi']
%%time
# Apply Log Transformation to Reduce Skewness
y = np.log1p(y)
# Create Polynomial & Interaction Features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
# Standardize Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Define Base Models
base_2 = [
('GradientBoosting_Tuned_2', GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,random_state=42, subsample=0.8)),
('GradientBoosting_Optima', GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,min_samples_leaf=7, min_samples_split=13,n_estimators=532, subsample=0.6777821390980873)),
('XGBoost_Tuned_Optima', xgb.XGBRegressor(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=0.7494749285293014, device=None,early_stopping_rounds=None, enable_categorical=False,eval_metric=None, feature_types=None, gamma=0.9450617049891935,grow_policy=None, importance_type=None,interaction_constraints=None, learning_rate=0.025181939608234893,max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=4, max_leaves=None,min_child_weight=2, monotone_constraints=None,multi_strategy=None, n_estimators=403, n_jobs=None,num_parallel_tree=None, random_state=None))
]
# Define Meta Model
meta_2 = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
# Stacking Models
stacked_model_2 = StackingRegressor(estimators=base_2, final_estimator=meta_2)
# Train Model
stacked_model_2.fit(X_train, y_train)
# Evaluate Model
r2_score = stacked_model_2.score(X_test, y_test)
print(f"Improved R² Score: {r2_score:.4f}")
Improved R² Score: 0.6118 CPU times: total: 1min 59s Wall time: 1min 49s
- without high VIF variable "square_meters" the R2 decreased
- without high VIF variable "rooms" the R2 decreased
- without high VIF variable "bathroom" the R2 decreased
- without high VIF variables "rooms" and "bathroom" the R2 decreased
- Decided not to continue searching for model enhacements due doubts on original dataset quality.
- Best model achieved R2 Score of 0.65
- Best model exclude from modeling variable "square_meters_price"
- Best model process considers polinomial features, lolg transformation and stacking
Evaluation Consolidated Notes¶
Regression Models
- Models to be tested are : Linear Regression, Lasso Regression, Ridge Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and Support Vector Regressor
- Performance Metrics:
- MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
- MSE (Mean Squared Error): Measures the average of the squares of the errors, giving more weight to larger errors.
- RMSE (Root Mean Squared Error): The square root of MSE, providing error in the same units as the target variable.
- R2 Score (Coefficient of Determination): Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
- Random Forest metrics: Lowest MAE, lowest RMSE, and highest R².
- Random Forest is the best performer overall, indicating strong predictive accuracy and low error.
- Decision Tree metrics: Moderate errors with a good R².
- Decision Tree is a strong candidate, although slightly behind Random Forest.
- Ridge, Linear, and Lasso Regression metrics are consistent with each other, but their performance is noticeably lower than the tree-based methods. They might not be ideal for further tuning if the goal is the best predictive performance.
- For hyperparameter tuning and further validation, Random Forest and Decision Tree stand out as the best candidates due to their superior performance metrics.
- While the linear models (Ridge, Linear, and Lasso) can serve as strong baselines, they do not match the predictive accuracy of the tree-based models.
- K-Nearest Neighbors and SVR appear less promising for further development on this dataset.
Feature Engineering
- From the feature importance plot, square_meters is the most significant variable, followed by square_meters_price.
- Since price is directly derived from square_meters * square_meters_price, including both may not add new information and could introduce redundancy.
- It makes no sence to ask end user square_meters and square_meters_price to "predict" price.
- NEW MODELS will be evaluated, with the feature square_meters_price DROPED from the data
- Although its VIF (1.568) is low (suggesting no strong collinearity within the dataset), the mathematical dependence between square_meters and square_meters_price suggests redundancy.
- This means the model could overestimate the importance of one feature over another and lead to unstable coefficient estimates.
- By keeping only square_meters, the model remains more interpretable, focusing on how space affects price rather than a derived variable.
- Noted features 'rooms' and 'bathroom' present high multicolinearity and will be also droped from modeling
- Defined function "preprocess_data(data, target_feature, drop_features, scale_features, test_size=0.30, random_state=1)", to iterate on the data preparation for modeling
- Data preparation droping the feature square_meters_price
- Linear Regression and Ridge Regression performed the best in terms of R² Score.
- Feature selection will be performed to reduce multicollinearity.
- Data preparation droping the feature 'rooms' due high multicolinearity
- After removing feature 'rooms' still Linear Regression and Ridge Regression performed the best in terms of R² Score, but also remains features with high multicolinearity
- Data preparation droping the feature 'bahtroom' due high multicolinearity
- Remains the feature real_state_flat with VIF>5
- Since "flat" is the most frequent category across neighborhoods, it might be highly correlated with certain neighborhood variables.
- Instead of removing real_state_flat, it will be considered as the Baseline Category for real_state
- Modified preprocess_data function to control one-hot encoding category to drop
- Selected real_state_flat and neighborhood_Eixample as the base line categories for one-hot encoding
- There is no multicolinearity in the data, suggesting the real state distribution in terms of number of rooms and bathrooms is not as relevant as the real state area, type and neighborhood
- Linear Regression and Ridge Regression are the best models among those tested, but R² scores suggest that the models are not explaining a significant portion of the variance in the target variable.
- More advanced models will be included in the evaluation
Advanced Regression Models
- Models to be tested are: DecisionTree_Tuned_1, RandomForest_Tuned_1, GradientBoosting_Tuned_1, XGBoost_Tuned_1, LightGBM_Tuned_1, NeuralNetwork(MLP)
- The best R2 score from the advanced models is currently 0.5450 with the Gradient Boosting model.
- Improving from 0.5113 Linear Regression could be a good start, but could potentially be improved further with model tuning
** Model Tuning**
- After model tunning, another slight improvement was achieved
- The best R2 Score from the tuned models is currently 0.5510 with the XGBoost_Tuned_Optima model.
- The best R2 Score achieved is low, and still suggest that the models are not explaining a significant portion of the variance in the target variable.
Revisit
- So far, various regression models were tested (Linear, Lasso, Ridge, Decision Tree, Random Forest, KNN).
- Feature selection and transformation steps were applied.
- Hyperparameter tuning was performed for some models (Random Forest, Gradient Boosting, XGBoost, LightGBM
- Optuna was used for advanced hyperparameter tuning.
- Despite all this, the R² score remained low
- In this section, a diferent approach will be tested aiming to reach better results.
- On this new approach, not all outliers will be removed.
- Only will be removed those over 3 standard deviations
- The remaining data will be considered as valid for modeling information, due it captures high areas or luxury units
- Feature selection done considering "real_state_flat" and "neighborhood_Eixample" as the base line categories for one-hot encoding
- Polynomial features are derived by raising existing numerical features to a power (e.g., squared, cubic) and creating interaction terms between different features. This extends the linear model to capture non-linear relationships in the data
- Adding polynomial and interaction terms can help the model learn more complex relationships between features, improving performance
- If Housing prices are influenced by complex interactions between features like square meters, number of rooms, and location, the a linear model might fail to capture these nuances.
- Stacking is an ensemble learning technique that combines multiple base models to make better predictions. It works in two main stages: a) Train base models independently: Several regressors (e.g., Random Forest, XGBoost, LightGBM) make individual predictions. b) Meta-model learns from base model outputs: A final estimator (often a linear model or another tree-based model) takes the predictions from the base models as inputs and learns to optimize the final prediction
- While individual models may overfit, the stacking regressor generalizes better by learning which model performs best in different scenarios
- Achieved R2 Score 0.58, still low
- Final aproach will be done considering Feature Engineering Enhancements like Log Transformation for Skewed Data and Polynomial & Interaction Features, Staking Modeling will be applied but testing other base models.
Final Modeling
- Applyed Log Transformation to Reduce Skewness. This transformation helps normalize right-skewed distributions, making the data more symmetrical and better suited for linear models.
- The dataset's 'price' distribution (as seen in the histogram) is highly skewed, and many machine learning models (like linear regression and tree-based models) perform better with normally distributed data.
- Applying the log transformation reduces the effect of extreme values (e.g., luxury properties with abnormally high prices) and improves the model's ability to capture general trends.
- Decision trees (Random Forest), gradient boosting (XGBoost, LightGBM) each have unique strengths in handling structured data. Stacking allows leveraging multiple perspectives.
- Whit this implementation it was achieved an improved R² Score 0.65
- New test removing high VIF variables and variating models
- without high VIF variable "square_meters" the R2 decreased
- without high VIF variable "rooms" the R2 decreased
- without high VIF variable "bathroom" the R2 decreased
- without high VIF variables "rooms" and "bathroom" the R2 decreased
- Decided not to continue searching for model enhacements due doubts on original dataset quality.
- Best model achieved R2 Score of 0.65
- Best model exclude from modeling variable "square_meters_price"
- Best model process considers polinomial features, lolg transformation and stacking
7. Deployment¶
Implementing the model in a production environment, making it accessible for real-world use. This might involve integrating the model with existing systems or deploying it via APIs or cloud platforms.
#Data load
inputs=pd.read_csv('df3_FATURE ENGINEERED_DATA.csv') # Original data, without missing values and feature "Unnamed"
data=inputs.copy()
# Outlier handling (drop values >3 standard deviations)
data = data[(np.abs(data.select_dtypes(include=np.number).apply(zscore)) < 3).all(axis=1)]
# Create dummy variables for categorical features with specified baseline categories
data = pd.get_dummies(data, columns=['real_state', 'neighborhood'], drop_first=False)
for feature, baseline in {'real_state': "flat", 'neighborhood': "Eixample"}.items():
if f"{feature}_{baseline}" in data.columns:
data.drop(columns=[f"{feature}_{baseline}"], inplace=True)
# Convert boolean columns to numeric (0 and 1)
bool_cols = data.select_dtypes(['bool']).columns
data[bool_cols] = data[bool_cols].astype(int)
target = "price"
outofmodel = ["square_meters_price"]
exclude_cols = [target] + outofmodel
features = [col for col in data.columns if col not in exclude_cols]
X = data[features]
y = data[target]
print(f'Original data features: {list(inputs.columns)}')
print(f'Modeling data features: {list(data.columns)}')
print(f'Output variable: {target}')
print(f'Input variables: {list(X.columns)}')
Original data features: ['price', 'rooms', 'bathroom', 'lift', 'terrace', 'square_meters', 'real_state', 'neighborhood', 'square_meters_price'] Modeling data features: ['price', 'rooms', 'bathroom', 'lift', 'terrace', 'square_meters', 'square_meters_price', 'real_state_apartment', 'real_state_attic', 'real_state_study', 'neighborhood_Ciutat Vella', 'neighborhood_Gràcia', 'neighborhood_Horta- Guinardo', 'neighborhood_Les Corts', 'neighborhood_Nou Barris', 'neighborhood_Sant Andreu', 'neighborhood_Sant Martí', 'neighborhood_Sants-Montjuïc', 'neighborhood_Sarria-Sant Gervasi'] Output variable: price Input variables: ['rooms', 'bathroom', 'lift', 'terrace', 'square_meters', 'real_state_apartment', 'real_state_attic', 'real_state_study', 'neighborhood_Ciutat Vella', 'neighborhood_Gràcia', 'neighborhood_Horta- Guinardo', 'neighborhood_Les Corts', 'neighborhood_Nou Barris', 'neighborhood_Sant Andreu', 'neighborhood_Sant Martí', 'neighborhood_Sants-Montjuïc', 'neighborhood_Sarria-Sant Gervasi']
%%time
# Apply Log Transformation to Reduce Skewness
y = np.log1p(y)
# Create Polynomial & Interaction Features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
# Standardize Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Define Base Models
base_final = [
('GradientBoosting_Tuned_2', GradientBoostingRegressor(min_samples_leaf=4, min_samples_split=10,random_state=42, subsample=0.8)),
('GradientBoosting_Optima', GradientBoostingRegressor(learning_rate=0.013201142547601463, max_depth=4,min_samples_leaf=7, min_samples_split=13,n_estimators=532, subsample=0.6777821390980873)),
('XGBoost_Tuned_Optima', xgb.XGBRegressor(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=0.7494749285293014, device=None,early_stopping_rounds=None, enable_categorical=False,eval_metric=None, feature_types=None, gamma=0.9450617049891935,grow_policy=None, importance_type=None,interaction_constraints=None, learning_rate=0.025181939608234893,max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=4, max_leaves=None,min_child_weight=2, monotone_constraints=None,multi_strategy=None, n_estimators=403, n_jobs=None,num_parallel_tree=None, random_state=None))
]
# Define Meta Model
meta_final = xgb.XGBRegressor(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=0.7494749285293014, device=None,early_stopping_rounds=None, enable_categorical=False,eval_metric=None, feature_types=None, gamma=0.9450617049891935,grow_policy=None, importance_type=None,interaction_constraints=None, learning_rate=0.025181939608234893,max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,max_delta_step=None, max_depth=4, max_leaves=None,min_child_weight=2, monotone_constraints=None,multi_strategy=None, n_estimators=403, n_jobs=None,num_parallel_tree=None, random_state=None)
# Stacking Models
stacked_model_final = StackingRegressor(estimators=base_final, final_estimator=meta_final)
# Train Model
stacked_model_final.fit(X_train, y_train)
# Evaluate Model
r2_score = stacked_model_final.score(X_test, y_test)
print(f"Improved R² Score: {r2_score:.4f}")
Improved R² Score: 0.6584 CPU times: total: 3min 20s Wall time: 3min 24s
# Create models directory if it doesn't exist
os.makedirs("models", exist_ok=True)
# Get current date
current_date = datetime.datetime.now().strftime("%Y-%m-%d")
# Export the best model
file_name = f"models/stacked_model_at_{current_date}.pkl"
joblib.dump(stacked_model_final, file_name)
# Export feature transformers
scaler_name = f"models/scaler_at_{current_date}.pkl"
joblib.dump(scaler, scaler_name)
poly_name = f"models/poly_at_{current_date}.pkl"
joblib.dump(poly, poly_name)
print("Models saved successfully!")
Models saved successfully!
- Best model achieved and related feature transformation files where saved
- Created a folder named "models" for models files saving
- Files saved with a filename that includes today's date for versioning purposes
- Model deployment code will be held on a separated Jupiter Nodebook using Stramlit
- Streamlit is a Python framework for building interactive web applications for machine learning models.
- The new file for the user interface will be named "PROJECT2_UI.py" and will load the model and build a user interface using Streamlit
8. Monitoring and Maintenance¶
Continuously monitoring the model's performance in production to ensure its accuracy and relevance over time. This stage may also involve retraining the model as new data becomes available.
- Sumarized code for model testing on new data when available
- New data should be without missing values before modeling, and have to mantain features structure and naming
- A separated file named "model_retrain.py" created to run the same modeling when new data become available
9. Communication and Reporting¶
Presenting findings and results to stakeholders in a clear and actionable manner, often through dashboards, visualizations, or reports.
- Data quality is fundamental to get optimal results.
- Considering the data quality issues (missing values and outliers) and limitations (only 10 neighbourhoods), is suggested to evaluate automated ways for data collection (i.e. web scraping) for better quality and wider data.
- The variable "square_meters_price" was not conidered for modeling, due the asumption of no modeling required to predict "price" if that variable is known together with the variable "square_meters".
- To enable the model to learn the complex relationships between features on this dataset, modeling considered:
- Feature Engineering Enhancements
- Log Transformation for Skewed Data
- Polynomial & Interaction Features
- Modeling with XGBoost, LightGBM, and Stacking
- To interact with uses, it was created a dedicated script (PROJECT2_UI.py) that runs a local web app where users can input values and get predictions.
- It is important to monitor the model's performance. If required to retrain the model, this can be done by a dedicated code (model_retrain.py).
- For sharing and visualization is made available an html version of this code
!jupyter nbconvert --to html PROJECT2_PYTHON.ipynb
C:\Users\otroc\AppData\Local\Programs\Python\Python313\Scripts\jupyter-nbconvert.EXE\__main__.py:4: DeprecationWarning: Parsing dates involving a day of month without a year specified is ambiguious and fails to parse leap day. The default behavior will change in Python 3.15 to either always raise an exception or to use a different default year (TBD). To avoid trouble, add a specific year to the input & format. See https://github.com/python/cpython/issues/70647. [NbConvertApp] Converting notebook PROJECT2_PYTHON.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 19 image(s). [NbConvertApp] Writing 4925911 bytes to PROJECT2_PYTHON.html